UNCLASSIFIED 


AD  NUMBER 


ADB200462 


NEW  LIMITATION  CHANGE 
TO 

Approved  for  public  release,  distribution 
unlimited 


FROM 

Distribution  authorized  to  DoD  only; 
Administrative/Operational  Use;  Oct  1950. 
Other  requests  shall  be  referred  to  USAF 
School  of  Aviation  Medicine,  Ranfolph  AFB, 
TX. 


AUTHORITY 


DTIC  Form  55,  dtd  November  27,  2001, 
Control  No.  1036003. 


THIS  PAGE  IS  UNCLASSIFIED 


,  1 

UNITED 

STATES 

AIR 

FORCE 


II  f  UJT 


elect! §4 

APR  1  3  1994 1  j| 


fOi; 


1?  ,*i  i"“r»  /•  • 


f~.  L  D 


4  1/  *  “  4 ■  <i  t  6  I  f  '■  r  I  •  i’:i 

yf  COAfGfi’ESS 


_  r  ■  r  (,  5  .  H  o  t  •  “ 

^^CiScE'mVisms^ 

UBBAW  ofcoh^ 


UB  BE  RETU® 


ScHwlol 

AVIATION 

MEDICINE 


DISCRIMINATORY  ANALYSIS 
.  SURVEY  OF  DISCRIMINATORY  ANALYSIS 


“DUG  USERS  ONLY" 


PROJECT  NUMBER  21-49-004 
REPORT  NUMBER  I 

(Formerly  Project  Number  21-02-105) 


19950411  110 


PflOJECT  REPORT 


AS8*H) 


DISCRIMINATORY  ANALYSIS 

I.  SURVEY  OF  DISCRIMINATORY  ANALYSIS 


Joseph  L.  Hodges,  Jr.,  Ph.D. 

Associate  Professor  of  Mathematical  Statistics 
Statistical  Laboratory 
University  of  California,  Berkeley 


PROJECT  NUMBER  21-49-004 
REPORT  NUMBER  I* 

(Formerly  Project  Number  21-02-105) 

*  A  University  of  California  report  under 
Contract  No.  AF4 I ( I 2S)-8 


USERS  ONLY* 


USAF  SCHOOL  OF  AVIATION  MEDICINE 

RANDOLPH  FIELD,  TEXAS 
OCTOBER  1950 


Precis 


OBJECT* 

To  survey,  in  as  nontechnical  a  manner  as  possible,  the 
extensive  literature  on  discriminatory  analysis  and  re¬ 
lated  topics. 


SUMMARY* 

The  literature  on  discriminatory  analysis  and  related 
topics  is  reviewed,  A  bibliography  of  over  250  refer¬ 
ences  is  appended.  Mathematical  research  projects  are 
suggested  in  relation  to  the  medical  and  psychological 
problems  of  Air  Force  selection  and  classification  pro¬ 
grams. 


CHAPTER  I 


Introduction. 

The  purpose  of  the  present  monograph  is  to  survey,  in  as 
nontechnical  a  manner  as  possible,  the  extensive  literature 
on  discriminatory  analysis  and  related  topics  which  is  listed 
in  the  bibliography  (pages  89  -  115)*  It  seems  desirable  to 
indicate  briefly  the  point  of  view  from  which  the  topics  were 
selected  and  discussed. 

In  a  narrow  sense  discriminatory  analysis  may  be  identi¬ 
fied  with  the  finite  multiple  classification  problem:  an 
individual  I  is  known  to  belong  to  just  one  of  k  speci¬ 
fied  categories  or  populations,  and  must  be  classified  into 
one  of  these  populations  on  the  basis  of  whatever  evidence  is 
available  about  I  and  about  the  populations.  The  classifi¬ 
cation  problem  becomes  statistical  when  we  further  specify 
that  the  available  evidence  about  I  consists  of  observed 
values  of  certain  random  variables,  these  random  variables 
having  different  probability  distributions  in  the  different 
populations. 

It  did  not  seem  reasonable,  however,  to  place  so  strict 
an  interpretation  on  the  subject  in  preparing  the  present 
survey.  The  techniques  employed  in  discriminatory  analysis 
are  intimately  related  to  certain  techniques,  especially  the 


coefficient  of  racial  likeness  and  the  generalized  distance, 
which  were  introduced  earlier,  and  it  was  not  possible  to 
convey  an  adequate  idea  of  the  development  of  discriminatory 
techniques  without  first  discussing  its  predecessors.  We 
have  therefore  devoted  Chapter  II  to  the  coefficient  of 
racial  likeness  and  Chapter  III  to  the  generalized  distance. 
Extensive  bibliographal  listings  are  also  given  for  these 
topics. 

Until  recently  discriminatory  analysis  has  been  es¬ 
sentially  no  more  than  the  application  of  the  linear  dis¬ 
criminant  function.  Correspondingly,  a  central  place  has 
been  given  to  this  topic.  The  discriminant  function  is 
introduced  in  Chapter  V;  in  Chapter  VI  there  is  presented  in 
tabular  form  a  collection  of  its  applicators  to  many  scien¬ 
tific  fields;  and  in  Chapter  VII  some  of  its  modifications 
and  extensions  are  discussed. 

The  entire  topic  of  multivariate  analysis  may  be  regard¬ 
ed  as  an  extension  of  the  discriminant  function,  but  it  did 
not  seem  reasonable  to  include  in  the  present  work  a  dis¬ 
cussion  of  multivariate  analysis.  We  have  restricted  our¬ 
selves  to  a  brief  indication  of  the  connections  between  the 
two  topics,  given  mostly  in  Chapters  IV  and  VII. 

In  his  invited  address  at  the  meeting  of  the  Institute 
of  Mathematical  Statistics  in  Berkeley,  California,  June  l6, 
194-9 »  Professor  M.  A.  Girshick  pointed  out  that  the  develop¬ 
ment  of  discriminatory  analysis  reflects  the  same  broad  phases 
as  does  the  general  history  of  statistical  inference.  We  may 


distinguish  a  Pearsonian  stage,  connected  with  the  coef¬ 
ficient  of  racial  likeness,  followed  by  a  Fisherian  stage, 
connected  with  the  linear  discriminant  function.  Girshick 
further  notes  a  Neyman-Pearson  stage  and  a  contemporary 
Waldian  stage,  which  are  discussed  here  in  Chapters  VIII 
and  IX,  respectively.  These  stages  are  marked  by  the  intro¬ 
duction  of  the  notions  of  probability  of  misclassif ication, 
and  of  risk. 

As  is  indicated  by  the  fact  that  the  bibliography  con¬ 
tains  over  250  listings,  it  was  impossible  to  give  a  thorough 
discussion  to  all  of  the  literature.  In  making  the  selection 
of  the  papers  to  be  presented  at  length,  two  principles  have 
been  followed.  We  have  tried  to  present  in  some  detail  the 
ideas  which  marked  important  conceptual  advances,  rather  than 
those  which  correspond  to  technical  elaborations.  And,  other 
things  being  equal,  we  have  preferred  the  simpler  topics  to 
the  more  complicated  ones.  This  preference  was  of  course 
dictated  by  the  desire  to  have  the  monograph  accessible  to 
persons  of  limited  training  in  mathematical  statistics. 

The  bibliography  was  compiled  by  scanning  recent  volumes 
of  the  main  statistical  journals,  by  consulting  bibliographic 
reference  works  such  as  Mathematical  Reviews,  Educational 
Index,  Statistical  Methodology  Index,  Psychological  Abstracts, 
and  Biological  Abstracts,  and  by  tracing  back  the  bibliograph¬ 
ic  references  in  the  papers  themselves.  Much  of  this  work 
was  done  by  the  assistants,  and  I  have  particularly  to  thank 
Mr.  Charles  Kraft  for  doing  most  of  the  final  checking  for 


accuracy.  We  tried  to  make  the  bibliography  as  complete  as 
possible,  and  would  appreciate  having  omissions  brought  to 
our  attention.  References  in  the  text  to  the  bibliography 
are  made  by  giving  author’ s  name  and  date.  A  list  of  periodi 
cals  is  given  at  the  end  of  the  report. 

In  conclusion  I  should  like  to  thank  my  friends  and 
colleagues,  Dr.  Evelyn  Fix  and  Professor  E.  L.  Lehmann,  who 
have  gone  through  much  of  the  manuscript  and  have  made  many 
constructive  changes.  Our  thanks  are  also  due  to  the  vari¬ 
ous  scholars  who  have  made  available  to  us  their  unpublished 
manuscripts;  in  particular  we  thank  T.  W.  Anderson,  Z.  W. 
Birnbaum,  G.  W.  Brown,  D.  G.  Chapman,  C.  L.  Chiang,  and  M. 


A.  Girshick- 


CHAPTER  II 


The  Coefficient  of  Racial  Likeness . 

Karl  Pearson  and  his  colleagues  at  University  College, 
London,  were  deeply  interested  in  the  possibility  that  human 
crania  might  be  used  in  the  study  of  anthropology  and  evo¬ 
lution.  They  formed  considerable  collections  of  skulls,  which 
were  carefully  measured  and  studied.  Frequently  the  samples 
were  quite  small,  so  that  it  frequently  became  desirable  to 
pool  closely  related  samples.  Hence  there  was  need  for  a 
test  of  the  significance  of  observed  differences  between  the 
samples,  which  could  be  applied  to  determine  whether  such 
pooling  would  be  appropriate.  There  were  available  tests  for 
the  significance  of  difference  of  two  normal  samples,  in  which 
each  observation  consisted  of  a  single  measurement,  but  in 
craniometric  work  it  was  usual  to  measure  as  many  as  $0  quanti¬ 
ties  on  each  skull.  As  Pearson  saw,  there  was  need  for  a 
test  which  would  compensate  for  the  smallness  of  the  samples 
by  the  large  number  of  quantities  which  might  be  measured  on 
each  individual. 

As  Pearson  wrote  later,  he  tackled  this  problem  in  1919. 
The  solution  which  he  obtained  was  published  in  1921,  in  a 
paper  written  by  Miss  M.  L.  Tildesley.  Miss  Tildesley  wanted 
to  know  whether  she  should  combine  two  small  samples  of 


Burmese  skulls  so  that  the  resulting  larger  sample  could  be 
used  to  give  a  more  reliable  estimate  of  Burmese  cranial 
characteristics.  To  answer  this  question,  she  used  the  coef¬ 
ficient  of  racial  likeness  (which  we  shall  hereafter  denote 
by  CRL).  The  CRL  was  given  a  number  of  slightly  differing 
definitions  but  in  a  simple  situation  it  might  be  defined  as 
follows. 

Suppose  we  have  two  samples,  say  a  sample  of  n^ 

individuals  (skulls),  and  a  sample  S2  of  n2  individuals. 
Suppose  that  on  each  individual  of  each  sample  we  measure  p 
traits.  Denote  the  value  of  the  ith  trait  measured  on  the 
jth  individual  of  the  ath  sample  by  Prom  these  measure¬ 

ments  we  compute  for  each  sample  and  each  trait  the  mean  and 
standard  deviation: 
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Pearson  then  would  define  the  CRL  to  be  the  quantity 


(2) 


The  motivation  of  Pearson* s  definition  is  approximately 
as  follows.  If  the  two  samples  do  come  from  the  same  popu¬ 
lation,  the  expected  value  of  x^  -  x2i  is  0;  and  in  any 
case  an  estimate  of  the  variance  of  x  -  x  is  given  by 


sli  s2i 

-  +  -  .  since  biological  measurements  are  often  approxi- 

mately  normally  distributed,  and  since  arithmetic  means  tend 
to  be  nearly  normal  even  if  the  averaged  quantities  are  not 
normal,  we  may  think  of 


Xli  “  X2i 


1=1,  2,' 


»  P 


as  being,  approximately,  normal  random  variables  of  unit  vari¬ 
ance,  whose  expected  values  are  0  if  the  two  samples  come 
from  the  same  population,  but  whose  expectations  would  usually 
differ  from  0  otherwise.  Now  if  the  p  random  variables 
were  independent,  a  reasonable  test  of  the  hypothesis  that 
the  samples  come  from  the  same  population  would  be  provided 
by  examining  the  sum  of  their  squares: 


(3) 
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The  quantity  (3)  would  have  approximately  a  chi-square  dis¬ 
tribution  of  p  degrees  of  freedom,  central  if  the  hypothesis 
were  true,  non-central  otherwise.  Prom  the  point  of  view  of 
modern  theory,  the  use  of  the  statistic  (3)  can  be  justified 
by,  for  example,  the  likelihood  ratio  principle.  And  since 
the  CRL  is  a  linear  function  of  (3)  the  use  of  the  CRL  as  a 
statistic  for  testing  the  hypothesis  of  homogeneity  still 
seems  reasonable,  provided  that  the  various  assumptions  men¬ 
tioned  above  hold. 


Pearson  did  not  suggest  that  the  chi-square  distribution 
be  used  with  the  CRL,  however.  For  most  of  the  applications 
in  craniometry,  p  would  be  large  enough  so  that  the  chi- 
square  distribution  could  be  replaced  by  the  normal  with 
negligible  loss  of  accuracy.  Pearson  gave  the  first  two 
moments  of  the  CRL  (assuming  the  hypothesis  true)  and  sug¬ 
gested  that  these  be  used  in  referring  a  computed  CRL  to  a 
normal  table.  (It  may  be  noted  that  the  formula  for  the 
second  moment  given  in  Miss  Tildesley*s  paper  is  wrong  by  a 
factor  of  i)..  This  mistake  was  repeated  in  a  number  of  sub¬ 
sequent  papers,  and  only  corrected  in  1926.) 

Pearson  was  well  aware  that  the  theoretical  justifi¬ 
cation  for  his  coefficient  rested  on  the  assumption  of  the 
independence  of  the  traits  measured.  The  correlation  of 
cranial  traits  had  been  the  subject  of  much  study  by  his 
school.  Miss  Tildesley  wrote,  ”...  we  do  know  quite  enough 
to  assert  that  the  correlation  is  never  very  high  between 
cranial  characters  which  do  not  have  any  portion  in  common, 
and  wHich  are  not  right  and  left  measurements  of  homologous 
characters.  It  is  indeed  often  wholly  negligible.”  As 
Pearson  pointed  out  (1926),  it  is  easy  in  theory  to  allow 
for  dependence  of  the  traits,  but  when  p.  is  as  large  as 
20  the  resulting  computations  are  overwhelming.  Hb  recom¬ 
mended  that  great  attention  be  paid  to  the  selection  of  traits 
little  correlated  with  each  other  within  the  sampled  popu¬ 
lations. 

From  the  point  of  view  of  the  development  of  discriminatory 


analysis  it  is  of  great  interest  to  observe  that  from  its  in¬ 
ception,  the  CRL  was  employed  for  two  rather  different  pur¬ 
poses.  Properly  speaking,  the  CRL  is  designed  as  a  test 
statistic,  large  values  of  which  are  supposed  to  reflect  high 
improbability  that  the  two  samples  are  drawn  from  the  same 
population.  In  applying  the  test,  one  selects  a  critical 
value,  say  c,  and  rejects  the  hypothesis  of  homogeneity  if 
the  CRL  exceeds  c.  The  value  of  c  is  selected  according 
to  the  level  of  significance  which  we  desire  our  test  to 


have;  by  increasing  the  value  of  c  we  decrease  the  proba¬ 
bility  of  rejecting  the  hypothesis  if  it  is  true. 


Now  suppose  TTq>  7T^,  and  Tt^  are  three  populations  from 
each  of  which  we  have  a  sample,  say  SQ,  S-^,  and  S2  re¬ 
spectively.  Suppose  we  compute  the  CRL  between  Sq  and 
and  find  it  to  have  the  value  C^  and  correspondingly  find 
the  CRL  between  Sq  and  S_  to  have  the  value  C  .  Suppose 
further  that  C^  >  C2»  We  could  then  select  a  critical  value 


c  which  lies  between  the  two  CRL’s: 


C1  >  c  >  C2.  At  the 


significance  level  corresponding  to  c,  we  should  accept  the 


hypothesis  that  f Tq  and  IT  are  identical  and  reject  the  hy¬ 
pothesis  that  TTq  and  TT^  are  identical.  An  examination  of 
this  situation  makes  it  easy  to  understand  why  there  is  a 


temptation  to  say,  in  such  cases,  that  M  TTQ  is  nearer  to 
^2  than  it  is  to  TT  ".  If  we  succumb  to  this  temptation, 
we  shall  be  using  the  CRL  not  as  a  test  statistic,  but  as  a 


measure  of  some  (as  yet  undefined)  concept  of  relative  degree 


of  resemblance  or  divergence  In  the  totality  of  populations 
under  study# 

It  should  be  clear  that  the  temptation  to  use  a  test  sta¬ 
tistic  as  a  metric  is  not  confined  to  the  CRL.  If  we  have  any 
statistic  for  testing  whether  two  samples  are  drawn  from  the 
same  population,  the  statistic  being  so  constructed  that  large 
values  are  indicative  of  difference  in  the  populations  sampled, 
then  it  is  rather  natural  to  interpret  larger  significant 
values  as  indicative  of  greater  differences. 

For  example,  Miss  Tildesley  computes  the  CRL  between 
French  and  English  skulls  (the  value  being  2i|..5) ,  and  also 
between  Egyptian  and  Negro  skulls  (the  value  being  27.3),  and 
then  states  "French  and  English  are  shown  to  be  almost  as  far 
apart  racially  as  Egyptians  and  Negroes."  Both  values  of  CRL 
are  highly  significant. 

In  the  years  following  1921,  Pearson’s  school  carried 
out  many  craniometric  researches  in  which  the  CRL  was  the 
principle  statistical  tool.  The  chief  contributor  to  this 
work  wa3  G.  M.  Morant.  Morant  commented  in  1923  °n  the  question 
of  the  use  of  the  CRL  as  a  measure  of  degree  of  resemblance, 
in  the  following  terms  (Morant  1923 »  P*  205): 

"the  value  of  [(2)]  computed  from  a  number  of  mean 
characters  of  two  races  is  the  Coefficient  of  Racial 
Likeness  between  them  and  it  is  thus  a  measure  of 
the  probability  of  the  two  being  random  samples 

from  the  same  population.  It  is  not  a  true  measure 
of  absolute  divergence,  and  must  not  for  a  moment 


10 


be  considered  as  such,  but  nevertheless  we  shall 
speak  of  it,  for  convenience,  as  if  it  were  an  ab¬ 
solute  measure  of  racial  affinity. w 
In  spite  of  this  warning,  however,  Morant  and  others  continued 
to  use  the  CRL  as  a  metric.  The  reason  for  this  inconsistency 
was  doubtless  the  fact  that  the  craniometrists  had  need  for 
such  a  metric,  and  the  CRL  was  the  only  tool  available  to 
them  for  such  a  purpose. 

Morant  was  by  no  means  an  uncritical  user  of  Pearson* s 
CRL.  In  1921).  he  had  this  to  say  on  the  subject  (Morant  19 21)., 

p.  12) : 

"Given  two  random  samples  each  of  ten  individuals 
drawn  from  the  same  homogeneous  population,  the  Coef¬ 
ficient  of  Racial  Likeness  deduced  from  the  mean 
characters  of  the  two  samples  will  not  differ  sig¬ 
nificantly  from  zero,  and  if  two  samples  each  of  a 
hundred  individuals  are  drawn  from  the  same  popu¬ 
lation  then  their  Coefficient  will  also  be  of  the 
same  order.  But  if  two  random  samples  each  of  ten 
individuals  are  drawn  from  two  different  populations 
and  then  two  samples  each  of  a  hundred  individuals 
are  drawn  from  the  same  differing  populations  it 
will  be  found  that  the  Coefficient  between  the  first 
pair  will  be  very  distinctly  less  than  that  between 
the  two  samples  of  a  hundred  individuals  each  ...  . 

It  is  for  this  reason  that  Coefficients  of  Racial 
Likeness  may  not  be  compared  directly  ..." 


The  reader  may  have  been  wondering  what  the  CRL,  whether 
viewed  as  a  test  or  as  a  measure ,  has  to  do  with  discrimina¬ 
tory  analysis.  There  is  an  obvious  way  in  which  a  measure  of 
divergence  can  be  used  for  discrimination  purposes.  If  we 
can  measure  the  divergence  of  an  individual  (or  a  sample) 
from  each  of  several  populations,  to, one  of  which  it  is  as¬ 
sumed  that  the  individual  (or  sample)  belongs,  then  it  seems 
reasonable  to  assign  the  individual  (sample)  to  that  popu¬ 
lation  from  which  the  measured  divergence  is  least.  In  a 
somewhat  similar  way  a  test  of  significance  of  difference  can 
be  used  as  a  discriminator:  we  assign  the  individual  to  that 
population  from  which  it  is  significantly  different  at  the 
largest  level  of  significance. 

In  1926  Morant  had  occasion  to  deal  with  a  discrimi¬ 
nation  question  in  craniometry  (Morant  1926b).  An  ancient 
skull  was  discovered  in  1888  in  the  commune  of  Chancelade  in 
France.  It  was  examined  by  an  anatomist,  Dr.  Testut,  who 

wrote,  ”parmi  les  races  actuelles,  celle  qui  me  para£t  presenter 

avec. 

la-plus  grande  analogie^l’homme  de  Chancelade  est  celle  des 
Esquimaux.”  Most  anthropologists  agreed  with  Testut’ s  con¬ 
clusion,  but  some  did  not.  In  192lj.  Sir  Arthur  Keith  wrote, 
”...the  Chancelade  skull,  while  possessing  a  few  superficial 
resemblances  to  Eskimo  skulls,  is  in  its  essential  character 
just  as  European  as  the  people  of  England  and  France  today,” 


12 


(quotations  from  Morant' s  paper).  We  have  here  a  clear  problem 
of  discrimination  and  Morant  approached  this  problem  biometri- 
cally. 

Before  seeing  what  Morant  did,  let  us  examine  the  CRL 
more  closely  as  a  possible  tool  for  discrimination.  In  practice, 
the  CRL  is  usually  employed  in  a  form  somewhat  different  from 
(2).  Let  <J  ^  denote  the  standard  deviation  of  the  ith  trait 
in  the  ath  population.  It  is  usually  assumed  that  ff  = 

in  fact,  in  craniometry  it  is  customary  to  replace  both 
and.  ff2i  by  a  value  d^  obtained  from  a  large  standard  sample, 
it  being  felt  that  the  variation  in  standard d eviation  from  one 
race  to  another  is  of  less  importance  than  the  sampling  error 
of  the  usual  small  samples „  With  the  assumption  =6^ , 

(2)  simplifies  to 

(4)  F  £ 

i=l 


Now  if  we  wish  to  compare  a  single  individual  with  each  of 
several  different  races,  we  would  compute  (if.)  between  a  first 
sample,  consisting  of  the  single  individual,  and  a  second 


sample,  consisting  in  turn  of  each  of  the  races.  Thus  n^ 
would  be  1,  would  be  the  value,  x  ,  of  the  ith  trait 

for  the  individual,  and  (if.)  would  become 


P 


n2  +  1 


1  a 


Finally,  suppose  that  We  have  a  large  sample  from  the  race; 


t hen  — —S— r-  approximates  to  1  and  ^21  ^en<^  Pr0^a” 

n2  +  I 

bility  to  the  population  mean  value,  say  ^  .  Thus,  the  CRL 
simplifies  to 


We  might  then  reasonably  compute  the  value  of  (6),  using  the 
mean  values  for  each  race  in  turn,  and  assign  the  indi¬ 

vidual  to  that  race  for  which  (6)  is  smallest. 

Now  let  us  consider  what  Morant  actually  did.  To  compare 
the  Chancelade  skull  with  male  Eskimo  skulls.,  he  obtained  the 
values  of  ^  and  6  from  large  samples  of  modern  male 
Eskimo  skulls,  and  computed,  for  p  =  55  traits,  the  values 
of  the  quantities 

xli  “  5i 


(7) 


If  the  Chancelade  skull  were  Eskimo,  we  should  have  here  ob¬ 
served  values  of  55  (supposedly  independent)  normal  deviates, 
and  might  use  these  values  to  test  the  hypothesis  that  the 
Chancelade  skull  is  Eskimo.  The  corresponding  test  might  be 
made  to  determine  whether  the  Chancelade  skull  resembles,  say, 
modern  English  skulls.  Morant  actually  makes  two  sets  of  such 
tests— by  computing  both  the  sample  mean  and  standard  deviation 
of  the  quantities  (7)  and  comparing  them  with  their  "theo¬ 
retical"  values.  Morant’ s  conclusion  was:  "...from  the 
evidence  afforded  by  the  skull  and  mandible,  we  may  accept  as 


a  reasonable  working  hypothesis  the  statement  that  the 
Chancelade  individual  was  distinctly  closer  to  the  Eskino  than 
to  the  modern  English.” 

Since  the  standard  deviation  of  the  quantities  (7)  is 
a  function  of  the  form  (6)  assumed  by  the  CRL  in  this  situ¬ 
ation,  it  turns  out  that  one  of  the  two  tests  made  by  Morant 
amounts  to  the  use  of  the  CRL  as  a  discriminator.  However, 
it  is  rather  curious  that  the  CRL  is  not  explicitly  mention¬ 
ed  by  Morant;  in  fact,  this  is  about  the  only  craniometric 
work  which  Morant  did  in  this  period  without  mentioning  the 
CRL.  It  is  a  rather  curious  historical  fact  that  the  con¬ 
nection  of  the  CRL  with  discrimination  did  not  come  in  the 
direct  way  just  discussed,  but  only  in  the  roundabout  fashion 
outlined  in  the  next  chapters. 

In  1926  Pearson  published  the  first  considerable  theo¬ 
retical  work  on  the  CRL.  In  this  paper,  Pearson  deals  with 
the  independence  assumption  underlying  his  coefficient.  In 
fact,  he  suggests  an  alternative  form  of  the  coefficient, 
which  is  suitable  if  the  traits  are  not  nearly  independent, 

and  if  there  are  only  a  few  of  them.  Let  r  4.  denote  the 

as  t 

sample  correlation  between  the  sth  and  t-th  traits  in  Sa. 

Just  as  it  is  convenient  to  assume  <?n  .  =  <y  =  <5  it  is 

li  2i  i 

convenient  to  assume  r  =  r  =  r  bet 

1st  2s  t  st 


(8) 


71J 

u  n-^  +  n2 
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Let  R  denote  the  correlation  matrix  of  »  T2 *  *  *  * »  7-p* 


R  = 


(9) 


1 

r21 


r12  ' '  ’  rlP 

1  r2P 


pi  p2 


and  let  Rst  denote  the  cofactor  of  R  at  the  sth  row  and 
t-th  column.  Then  it  is  known  that 


(10) 


|R| 


E  E 


3=1  tal 


will,  if  the  samples  ape  drawn  from  the  same  population  and 
the  matrix  R  is  exact,  have  a  chi-square  distribution  with 
p  degrees  of  freedom.  The  quantity  (10)  may  be  considered 
to  be  a  generalization  of  the  original  CRL  (Ij.) ,  to  which  it 
reduces  if  the  traits  are  independent. 

Pearson  points  out  the  great  labor  involved  in  comput- 
ing  (10)  when  p  is  as  large  as,  say,  20.  He  concludes  that 
’for  the  statistician,  as  for  the  statesman,  the  ideally  best 
is  not  always  the  wisest  course. 

In  1928  Morant  returned  to  the  difficulty  he  had  pointed 
out  in  192lp »  that  arises  when  one  wishes  to  use  the  CRL  as  a 
measure  of  dispersion  in  cases  in  which  the  sample  sizes  differ 
widely.  He  suggested  a  corrective  factor  to  be  applied  to  re¬ 
duce  the  CRL  to  a  standard  sample  size.  Morant* s  criticism 
and  suggested  correction  are  very  similar  to  those  offered  at 


about  the  same  time  by  P.  C.  Mahalanobis,  and  we  shall  defer 

discussion  till  the  next  section.  Finally  in  1928  Pearson 

gave  way  before  the  arguments  of  Morant  and  Mahalanobis  (K. 

Pearson  1928b) ,  and  sanctioned  a  corrective  factor  which  in 

2 

essence  reduces  the  CRL  to  the  D  statistic  discussed  in  the 
next  section. 

After  1928  numerous  papers  applying  the  CRL  to  cranio- 

metric  work  continued  to  appear  in  Biometrika.  Further  theo- 

2 

retical  work  shifted  into  other  lines#  however.  The  D 
statistic,  introduced  originally  as  a  modification  of  the 
CRL#  was  studied  extensively  by  the  Indian  school,  with  a 
steady  development  of  the  relevant  distribution  theory  cul¬ 
minating  in  a  paper  by  Bose  and  Roy  in  1938.  And  in  the  West# 
work  of  Fisher  and  Hotelling  on  different  but  related  problems 
prepared  the  way  for  the  introduction  of  the  linear  dis¬ 
criminant  function  in  1935*  In  an  important  paper  of  Fisher 
in  1938#  these  various  lines  of  development  were  brought  to¬ 
gether.  We  shall  trace  the  important  features  of  these  re¬ 
searches  in  the  next  three  chapters. 

In  the  bibliography  there  is  an  extensive  listing  of 
papers  pertaining  to  the  CRL.  Among  these  are  Batrawi  and 
Morant  (194?) »  von  Bonin  (1931a »  1931b >  193&)  *  von  Bonin  and 
Morant  (1938),  Cleaver  (1937) »  Collett  (1933),  Dingwall  and 
Young  (1933)i  Goodman  and  Morant  ( I94O ) ,  Harrower  (1928), 
Hasluck  and  Morant  (I929),  Hooke  (1926),  Hooke  and  Morant 

(1926),  Kitson  (1931),  Kitson  and  Morant  (1933),  Layard  and 
Young  (1935),  Little  (1943) »  Martin  (1936),  Morant  (1923# 


19%»  1925,  1926a,  1926c,  1927a,  1927b,  1928a,  1928b,  1929a, 
1929b,  1931,  1935*  1936a,  1936b,  1937,  1939a,  1939b),  K. 
Pearson  (1926,  1928a,  1928b),  Reid  and  Morant  (1928) ,  Risdon 
( 1939 )» Stoessiger  (1927) »  Stoessiger  and  Morant  (1932), 

Tilde sley  (1921) ,  Woo  (1930),  Woo  and  Morant  (1932),  and 
Young  (193l)«  The  bulk  of  these  papers  contain  only  routine 
applications  of  the  CRL  to  craniology,  and  are  devoid  of  theo¬ 
retical  interest.  Of  greater  interest  are  certain  papers 
which  approach  the  CRL  in  a  critical  spirit.  We  have  already 
mentioned  some  of  the  comments  of  Morant,  and  those  of 
Mahalanobis  will  be  further  discussed  in  the  next  chapter. 

In  this  regard  one  may  mention  Pearl  and  Miner  (1935) » 

Fisher  (1936a),  and  Seltzer  (1937), 

Certain  other  writers  proposed  coefficients  similar  to 
the  CRL,  independently  of  and  sometimes  earlier  than  Pearson. 
Joyce  (1912)  credits  to  H.  E.  Soper  a  "differential  index'* 
which  resembles  the  CRL  except  that  the  terms  are  not  squared! 
this  reduces  the  statistical  efficiency.  A  still  more  primi¬ 
tive  coefficient  is  that  of  Aebly  (1926) ,  in  which  diffeiv 
ences  ar®  not  compared  with  their  variabilities,  but  are 
summed  directly. 


CHAPTER  III 


The  Generalized  Distance . 

* 

In  1923-1925#  P.  C.  Mahalanobis  was  engaged  in  an  anthro¬ 
pometric  study  of  the  Anglo-Indians  of  Calcutta,  and  of  their 
relations  to  other  racial  groups.  He  at  first  employed  the 
then  recently  devised  CRL  as  a  principal  statistical  tool,  but 
(as  had  been  Morant)  was  disturbed  by  the  influence  of  sample 
size  on  the  CRL  when  it  was  used  as  a  measure  of  the  diver- 
s  gence  of  two  populations.  On  what  appear  to  have  been  rather 
’  intuitive  grounds,  Mahalanobis  decided  to  drop  the  coefficient 

nl  np 

- - -  and  obtained  in  this  way  a  statistic 

nl  n2 

P 

u>  p2  =  |  72  I  hn  hi  \2  . 

p  i=l  \  Oj  I 

This  statistic ,  called  at  first  the  ^ caste— distance*^  and  later 
the  ^generalized  distance*,  was  used  by  Mahalanobis  in  the 
presidential  address  delivered  to  the  Anthropological  Section 
of  the  Indian  Science  Congress  in  1925  (Mahalanobis  I927), 
which  was  published  in  1929*. 

p 

The  contrast  b e tween  the  CRL  and  D  is  made  clear  if  we 
consider  xvhat  happens  when  the  sample  sizes  n^  and  n2  are 
increased.  If  there  is  in  fact  no  difference  between  the 


populations  with  regard  to  the  means  of  the  traits,  the  distri- 


n.  Up 

but ion  of  CRL  will  remain  unchanged:  the  increase  of  — — — — 

nl  +  n2 

will  serve  precisely  to  counterbalance  the  tendency  of 
^li  "  *2i  aPProac^  0n  °ther  hand,  if  there  is  an 
actual  difference  in  the  population  means,  say  cf(  4s  0#  then 
(xn  -  *2^)  will  tend  in  probability  to  the  positive  quanti¬ 
ty  as  the  sample  sizes  are  increased.  Consequently  the 

CRL  will  tend  in  probability  to  oo.  It  is  thus,  as  Morant 
saw  in  1924#  unreasonable  to  use  the  CRL  as  a  measure  of  diver¬ 
gence  unless  the  sample  sizes  are  always  the  same.  This  diffi¬ 
culty  does  not  arise  in  the  case  of  D^.  If  £  and  £ 

^li  ^2i 

denote  the  population  means,  then  as  the  sample  sizes  are  in- 

p 

creased,  D  tends  in  probability  to 


( 

J  t  \ 


?li  “  ^2i 


O 

We  may  therefore  view  the  sample  quantity  D  as  a  point  esti- 

o 

math  of  the  corresponding  population  quantity  A  ,  and 
state  that  the  estimate  is  consistent  (i.e.,  tends  in  proba¬ 
bility  to  the  quantity  being  estimated  as  the  sample  sizes  are 
increased) . 

Mahalanobis  has  stated  (1949#  p.  237)  that  he  presented 

the  foregoing  argument  to  Karl  Pearson  in  1927,  and  that 

Pearson  refused  to  admit  its  validity.  In  any  case,  Mahalanobis 

2 

began  to  use  his  D  statistic,  and  in  1928  Morant  published  a 
very  similar  argument,  together  with  numerical  data  showing 


the  tendency  of  the  CRL  to  increase  with  the  sample  size. 

Morant  suggested t hat  CRL’s  based  on  widely  different  sample 

sizes  could  be  made  comparable  by  corrective  factors.  Pearson 

in  the  same  year  endorsed  Morant* s  suggestion,  whose  effect 

2 

is  in  practice  to  make  the  CRL  very  similar  to  D  . 

Prom  1930  to  1938  the  Indian  school  devoted  much  effort 

2 

to  developing  the  distribution  theory  of  the  D  statistic. 

In  reviewing  the  history  of  this  research,  it  will  be  con¬ 
venient  to  introduce  some  terminology  to  describe  the  various 

2 

assumptions  under  which  one  may  study  the  distribution  of  D 
and  related  statistics. 

The  reader  may  have  been  disturbed  by  the  way  in  which 

Pearson  and  his  followers  employ  for  the  standard  deviations 

o\  quantities  obtained  from  extraneous  sources,  and  ignore 

the  sampling  variability  of  these  estimates.  Practically, 

if  the  samples  are  large*  the  variability  of  sample  estimates 

of  the  variance  will  not  make  a  major  contribution  to  the 

2 

distribution  of  the  CRL  or  of  D  .  In  a  sense,  the  values  of 
the  variances  are  of  secondary  importance  to  the  values  of 
the  mean  differences.  But  as  the  theory  of  statistics  de¬ 
velops  refinement,  and  its  methods  are  applied  to  smaller 
samples,  it  becomes  desirable  to  take  into  account  the  sampl¬ 
ing  fluctuation  of  <J,.  It  was  the  great  contribution  of 
Student  (1908)  to  recognize  that  the  ratio  of  the  mean  devi¬ 
ation  of  a  normal  sample  to  the  estimate  of  the  standard 
deviation  based  on  the  sample,  did  not  have  a  normal  distri¬ 
bution.  It  seems  reasonable  to  distinguish,  therefore,  be- 


tween  the  "classical”  and  "S tudentized"  versions  of  the 
distribution  theory  problem.  If  it  is  considered  that  the 
quantities  df.  and  r  ,  which  are  employed  represent  true 

1  Su 

population  values,  we  shall  say  that  the  problem  is  b eing 
treated  in  its  "classical"  form;  while  if  account  is  taken 
of  the  fact  that  these  quantities  are  estimated  from  the 
samples,  we  shall  refer  to  the  problem  as  "studentized. " 

The  problem  may  be  further  characterized  by  either  mak¬ 
ing  or  not  making  the  assumption  that  the  traits  are  inde- 

p 

pendent.  As  with  the  CRL,  the  D  statistic  was  at  first  con¬ 
sidered  only  in  the  case  that  the  quantities  -  x^  ; 

i  =  1,  2,**#,  p,  are  independently  distributed.  By  1935* 
however,  the  obvious  extension  involving  the  addition  of 
correlational  terms  had  been  made:  the  corresponding  ex¬ 
tension  for  the  CRL  was  made  by  Pearson  in  1926.  The  de- 

2 

pendent  version  of  D  ,  using  the  notation  of  (10),  Chapter 
II,  is  given  by 


P  P 

B  £hs 

s=l  t=l 


A  third  categorization  of  the  distribution  problem 

2 

follows  by  observing  that  the  distribution  of  D  may  be 
sought  either  in  case  the  populations  sampled  are  the  same 
(which  we  shall  refer  to  as  the  central  case),  or  in  case 
the  populations  differ  for  at  least  one  trait  (which  we  shall 
refer  to  as  the  noncentral  case).  In  summary,  we  may  seek 


2 

the  distribution  of  D  (or  of  the  CRL)  for  the  classical  or 
Studentized,  for  the  independent  or  dependent*  and  for  the 
central  or  noncentral*  cases.  There  are  thus  in  total  eight 
possible  situations. 

In  the  terminology  just  introduced,  we  may  say  that 
Pearson  in  1921  gave  an  approximate  distribution  for  the 
central,  classical,  independent  CRL,  and  that  in  1926  he 
gave  the  exact  distribution  for  the  central  and  classical 
CRL,  which  turned  out  to  be  the  same  (chi-square)  regardless 
of  independence. 

In  the  same  terminology,  P.  C.  Mahalanobis  considered 

the  independent,  classical  case,  both  central  and  noncentral, 

in  1930»  This  was  the  first  considerable  paper  on  the  theory 
2 

of  the  D  statistic.  By  a  method  thought  to  be  approximate 
(series  expansion),  Mahalanobis  obtained  the  first  four  moments 

p 

of  D.  Prom  these,  and  from  large  scale  sampling  experiments, 

he  was  able  to  state:  "We  conclude  therefore  that  the  distri- 
2 

button  of  D  will  conform  generally  to  Type  I  of  the  Pearson- 
ian  family,  except  in  the  case  of  two  groups  (or  samples) 
taken  from  the  same  population,  when  the  distribution  will 
pass  into  the  Type  III  curve," 

p 

In  the  mid-1930’s,  the  distribution  problem  of  D  was 

attacked  by  R.  C.  Bose,  In  1935  Mahalanobis  had  published 

2 

the  dependent  form  of  D  mentioned  above,  and  Bose  first 

2 

considered  the  classical  D’»  in  both  the  independent  and  de¬ 
pendent  cases,  both  centrally  and  noncentrally.  He  was  able 


to  obtain  the  exact  distribution,  and  hence  the  moments.  It 

was  found  that  the  results  of  Mahalanobis  were  exact,  and  were 

correct  without  the  independence  assumption. 

R.  C,  Bose  continued  to  work  on  the  problem,  trying  to 

remove  the  assumption  that  the  covariance  matrix  is  known. 

In  1936  Mahalanobis  defined  explicitly  the  "Studbntized"  form 
2 

of’  D  ,  and  reported  that  Bose  had  found  the  first  four  moments 

p 

of  D  in  the  noncentral  Student! zed  case.  Finally,  in  January, 
1938*  Bose  and  S.  N.  Roy  were  able  to  report  to  the  first  ses¬ 
sion  of  the  Indian  Statistical  Congress  that  they  had  succeed¬ 
ed  in  solving  the  complete  problem:  they  had  found  the  distri- 
2 

bution  of  D  in  the  Studentized  case,  whether  central  or  non¬ 
central,  whether  independent  or  correlated. 

The  chairman  of  the  meeting  was  R,  A.  Fisher,  and  at  the 
end  of  the  paper  of  Bose  and  Roy,  Fisher  rose  to  point  out 
that  he  had  given  (however,  in  connection  with  a  quite  differ¬ 
ent  statistical  problem)  the  distribution  which  they  had  ob¬ 
tained,  in  a  paper  published  in  1928.  It  was  also  pointed  out 
that  Hotelling  in  1931  had  obtained,  also  in  another  connection, 
the  Bose-Roy  distribution  for  the  central  Studentized  case.  It 
is  reported  that  Fisher  remarked  "that  he,  and  Professors 
Hotelling  and  Mahalanobis  had  been  unwittingly  treading  the 
same  ground.  He  was  glad  to  avail  himself  of  the  present  op¬ 
portunity  to  clear  up  this  point."  In  the  same  year  (1938) 
Fisher  published  in  his  journal  a  paper  pointing  out  the  close 
connection  between  several  independent  lines  of  development. 


The  Indian  school  has  continued  to  develop  the  theory  of 

p 

D  ,  usually  without  reference  to  parallel  developments  in  the 

West.  Roy  and  Bose  (I9l}.0)  have  modified  the  statistic  to 

permit  the  covariance  estimates  to  be  based  on  several  samples 

while  the  mean  differences  are  based  on  two.  Bhattacharyya 

and  Narayan  (19lj.l)  have  investigated  the  D  moments  when  the 

population  variances  are  unequal.  A.  Bhattacharyya  (19ij.6)  has 
2 

extended  the  D  statistic  to  the  measurement  of  divergence  be¬ 
tween  multinomial  distributions.  P.  K.  Bose  (19 IjJa,  192|-7b, 

19I/.9 )  has  developed  recursion  formulae  with  the  aid  of  which  < 

he  has  tabled  percentage  points  of  the  central  and  noncentral 
2 

D  distribution,  in  both  the  classical  and  Studsntized  cases. 
Bose  is  apparently  unaware  of  the  relation  of  his  distri¬ 
butions  to  the  chi-square  and  F  distributions,  and  as  a  re¬ 
sult  seems  in  some  cases  to  have  duplicated  existing  tables. 

2 

The  D  statistic  has  recently  been  used  as  a  major  tool 
in  a  very  extensive  anthropological  investigation  (Mahalanobis , 
Majumdar,  and  Rao) ,  which 'comprises  Parts  2  and  3  of  Volume  9 
of  SSnkhya.  The  paper  has  several  appendices  in  which  various 
theoretical  points  are  discussed. 


CHAPTER  IV 


Beginnings  of  Multivariate  Analysis. 

The  coefficient  of  racial  likeness  and  the  generalized 
distance  share  a  feature  which  serves  to  distinguish  them  from 
much  of  the  preceding  work  in  statistics.  Both  of  these  tech¬ 
niques  represent  attempts  to  deal  with  inference  problems  in 
which  the  data  consists  of  several  correlated  (normal)  measure 
ments,  say  X^,  X.^**"*  Xp>  made  on  each  individual  or  experi 
ment  considered.  These  statistics  are  therefore  precursors 
of  the  theory  of  multivariate  (normal)  analysis,  a  prominent 
example  of  which  is  the  linear  discriminant  function.  Before 
discussing  the  linear  discriminant  function  it  will  be  useful 
to  describe  briefly  some  developments  of  multivariate  analysis, 
most  of  which  occurred  between  1928  and  193®» 

Beginning  with  the  publication  of  Student* s  revolution¬ 
ary  paper  in  1908,  the  English  school  of  statisticians  have 
devoted  much  effort  to  obtaining  analytical  expressions  for 
the  distributions  of  commonly  used  statistics  based  on  normal 
samples.  Previously,  in  1900,  Karl  Pearson  had  obtained  the 
chi-square  dis tribution  as  an  approximate  distribution  for  a 
test  of  goodness  of  fit.  In  Student* s  1908  paper,  the  chi- 
square  distribution  was  offered  as  the  distribution  of  the 
sample  variance  of  a  normal  sample,  and  a  distribution 


equivalent  to  what  is  now  known  as  the  Student  t-dis tribution 
wa3  given  for  the  ratio  of  sample  mean  to  sample  standard  devi 
ation.  The  next  major  advance  occurred  in  19 15,  when  R,  A, 
Fisher,  in  finding  the  distribution  of  the  correlation  coef¬ 
ficient  computed  from  a  bivariate  normal  sample,  introduced 
his  method  of  geometrical  argument  in  Euclidean  hyperspace. 
However  much  this  method  may  fall  short  of  present-day  re¬ 
quirements  for  rigor,  in  the  hands  of  Fisher  it  was  to  pro¬ 
duce  in  the  next  fifteen  years  revolutionary  results*  In 
1921  Fisher  applied  his  geometrical  argument  to  find  the 
distribution  of  the  intraclass  correlation  coefficient.  The 
distribution  was  labelled  by  Fisher  with  the  letter  z,  a 
symbol  now  famous  in  statistics.  It  subsequently  developed 
that  the  z -distribution  had  applications  far  more  important 
than  those  to  the  intraclass  correlation  coefficient;  in 
fact,  it  turned  out  to  be  the  general  distribution  needed  to 
establish  the  level  of  significance  of  all  analysis  of  vari¬ 
ance  tests.  In  a  transformed  version  it  is  now  widely  known 
as  the  F-dis tribution,  having  been  so  named  by  Snedecor  in 
Fisher* s  honor.  In  a  series  of  papers  from  1921,  Fisher  and 
others  gradually  extended  the  statistical  usefulness  of  the 
F-dis tribution.  Kclodzie jczyk  (1936)  reduced  its  use  to  the 
canonical  form  of  tests  of  linear  hypotheses. 

In  1928  Fisher  published  another  paper  which  is  basic 
for  the  development  of  discriminatory  analysis.  Again  em¬ 
ploying  the  geometrical  approach,  he  obtained  the  formula 
for  the  distribution  of  the  multiple  correlation  coefficient 


for  normal  variables.  Although  this  was  the  immediate  pur¬ 
pose  of  his  work#  it  is  rather  two  other  results#  given  more 
or  less  as  corollaries,  which  concern  us.  As  a  limiting  form 
of  the  multiple  correlation  coefficient  distribution#  Fisher 
obtained  a  distribution,  which  he  labelled  (B)j  and  as  a  vari¬ 
ant  form  he  obtained  a  third  distribution  labelled  (C).  The 
(B)  distribution  is  today  known  as  the  noncentral  chi-square 
distribution,  and  Fisher  recognized  that  it  ’’may  be  inter¬ 
preted  as  the  distribution  of  the  sum  of  squares  of  n  vari¬ 
ates  normally  distributed  with  equal  variance,  but  not  with 
zero  means.”  The  distribution  (C)  is  what  is  now  known  as 
the  noncentral  F-distribution,  whose  main  present  day  use  is 
in  determining  the  power  of  analysis  of  variance  tests.  Need¬ 
less  to  say,  Fisher  did  not  put  his  (G)  distribution  to  such 
a  use  in  1928,  but  he  did  discuss  one  example  (the  distri¬ 
bution  of  a  correlation  ratio)  which  serves  as  precursor  to 
the  modern  use. 

It  is  interesting  that  the  necessary  analytic  work  had 

been  done  by  1928  for  finding  all  of  the  eight  distributions 

2 

mentioned  in  connection  with  D  in  the  preceding  chapter,  in 

spite  of  this  it  took  ten  years  for  the  statistical  appli- 

p 

cations  of  these  distributions  to  D  to  be  realized;  and  when 
they  were,  the  realization  came  independently  to  two  different 
investigators. 

In  1931  the  central  Studentized  case  was  obtained  by 
Hotelling.  Hotelling  was.  interested  in  extending  the  work  of 
Student  to  normal  vectors.  Student’s  t-distribution  made  it 


possible  to  test  the  hypothesis  that  the  mean  of  a  normal  popu¬ 
lation  has  a  specified  value ,  without  assuming  knowledge  of 
the  value  of  the  variance.  Suppose  that  instead  of  a  sample 
from  a  univariate  normal  population  we  have  a  sample  from  a 
multivariate  normal  population,  and  wish  to  test  simultaneous¬ 
ly  hypotheses  specifying  the  values  of  the  population  means 
of  all  components  of  the  normal  vectors  involved.  Hotelling 
had  previously  studied  problems  of  this  kind  while  partici¬ 
pating  in  an  investigation  of  the  flow  of  particles  in  proto¬ 
plasm  (Baas-Becking,  et  al,  1928)e  To  deal  with  this  testing 
problem,  Hotelling  suggested  (apparently  on  intuitive  grounds) 

O 

a  test  statistic,  termed  by  him  T  ,  and  obtained  its  distri- 

p 

bution.  The  T  statistic  is  a  direct  generalization  of  the 
Student  t  ,  and  is,  except  for  a  constant  multiplier,  identi¬ 
cal  with  the  correlated  form  of  the  CRL,  given  by  Pearson  in 
1926,  where  the  variances  and  correlations  are  not  assumed 

population  values,  but  values  estimated  from  the  sample.  The 

0 

distribution  of  T  obtained  by  Hotelling  is  simply  the  central 
F-distribution  first  found  by  Fisher  in  1921.  Hotelling’s 
great  contribution  was  to  show  that  Fisher's  distribution  was 
the  appropriate  one  for  a  large  class  of  testing  problems,  in¬ 
cluding  one  of  interest  to  us.  T2  may  be  used  to  test  the  hy¬ 
pothesis  that  two  multivariate  samples  have  been  drawn  from 
the  same  normal  population,  assuming  that  the  samples  come 

from  normal  populations  having  the  same  covariance  matrix. 

We  proceed  to  describe  this  test  in  some  details 

using  the  notation  employed  in  Chapter  II,  let  x^^. 


denote  the  value  of  the  ith  trait  measured  on  the  kth  indi¬ 
vidual  from  the  a^1  sample,  where  a  =  1,  2,  k  =1,  na> 

and  i  =1,  p.  Let  and  X£^  he  the  arithmetic 

means  of  the  values  of  the  ith  trait  in  the  first  and  second 
samples,  respectively,  and  define 


xli  "  x2i 


d  =  — — - . 

+  I 

V)  nx  n; 


= ~  >  n  =  n-,  +  n-5  -  2  , 

1  .  i_ 

2 


nl  n2 

n  aij  =  S  ^xiik  "  xli^x2ik  -  x2k^  +  2  U 


k=l 


2jk  -  A2k;  ^  lx2ik~x2i^x2jk“X2j)* 
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elements  a  .  (It  is  this  matrix  inversion  which  begins  to 
present  great  practical  difficulties  if  p  is  very  large). 
Hotelling1 s  statistic  is  then 


T  = 


£ 

i=l  j=l 


di  dj* 
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p 

The  hypothesis  is  rejected  when  T  is  too  large,  the  critical 

value  for  rejection  being  of  course  set  according  to  the  level 

n+l~T)  p 

of  significance  desired.  Hotelling  oroved  that  — — —  T  has 

n*p 

the  distribution  of  F  with  p  and  n+l~p  degrees  of  free¬ 
dom*  The  critical  values  may  therefore  be  taken  from  the  widely 
available  tables  of  percentage  points  of  the  F-distribution. 

The  test  just  discussed  is  pertinent  to  the  discrimination 
problem,  since  there  is  no  point  in  worrying  about  which  of 
two  populations  an  individual  comes  from  unless  the  two  popu¬ 
lations  are  distinguishable.  In  the  applications  of  the  linear 

discriminant  function  (Chapters  V  and  VI)  it  is  customary  first 
2 

to  employ  the  T  test  to  establish  the  difference  of  the  popu¬ 
lations  involved, 

2 

The  choice  by  Hotelling  of  the  T  statistic  seems  to  have 
been  based  on  intuition.  It  is  interesting  that  this  par¬ 
ticular  statistic  may  be  obtained  by  applying  a  general  princi¬ 
ple,  and  that  it  has  certain  optimum  properties.  Neyman  and 
Pearson  (1928)  proposed  the  likelihood  ratio  criterion  for  ob¬ 
taining  statistical  tests,  and  applied  this  criterion  in  I93O 
and  1931  to  obtain  tests  of  the  hypotheses  that  two  or  more 
univariate  normal  samples  arose  from  the  same  population. 

Wilks  (1932)  obtained  tests  for  a  number  of  multivariate  normal 

hypotheses  by  application  of  the  likelihood  ratio  principle. 

In  particular,  Wilks  found  the  likelihood  ratio  statistic  for 

testing  the  hypothesis  that  k  p-variate  samples  came  from 
the  same  population,  assuming  that  the  samples  arose  from  nor¬ 
mal  populations  having  the  same  (unknown)  covariance  matrix. 


When  k  -  2,  Wilks*  result  reduces  to  Hotelling*  s. 

Wilks  also  found  the  likelihood  ratio  test  for  the  hy¬ 
pothesis  that  several  normal  populations  have  the  same  co- 
variance  matrix.  Since  the  assumption  of  equal  covariance  • 
matrices  underlies  the  linear  discriminant  functions,  the  latter 
test  is  sometimes  used  as  a  preliminary  to  discrimination, 
Bartlett  (1937)  proposed  a  modification  of  the  constant  factors 
of  the  Wilks  criterion  and  other  modifications  and  applications 
of  these  procedures  have  been  considered  extensively,  for  ex¬ 
ample  by  Lawley  (1938®  1939).?  Bishop  (1939)  >  and  Bishop  and 
Hair  (1939)**  Exact  tables  are  available  only  for  the  case 
p  as-  1  (Thompson  and  Merrington,  1943')* 

p 

Work  on  T^  is  still  continuing,  Hsu  (1938)  investl- 

2 

gated  the  noncentral  T  -distribution,  that  is,  the  distri- 

2 

bution  of  the  T  test  statistic  in  case  the  sampled  populations 

o 

are  in  fact  different.  He  found  that  the  noncentral  T- 
distribution  coincides  with  the  noncentral  P-distribution  in¬ 
vestigated  by  Tang  (1938)  and  with  the  (C)  distribution  of 
Fisher  (1928),  Because  of  the  identity  of  T^  and  Dd ,  Hsu* s 
result  is  equivalent  to  that  of  Bose  and  Roy  (1938)  discuss¬ 
ed  in  Chapter  III.  In  I9I1I,  Simaika,  following  the  lead  of 

p 

Hsu  (19i|l)>  demonstrated  that  T~  has  the  greatest  power  of  any 

test  whose  power  depends  only  on  the  distance  (  £^d)  between 
the  populations.  Further  optimum  properties  of  the  test 

are  known,  Wolfowitz  (1949)  showed  that  the  test  is  the 

most  stringent  similar  test,  and  Hunt  and  Stein  showed  the 
2 

T  test  to  be  most  stringent  and  the  uniformly  most  powerful 


invariant  test  (see  Lehmann,  1950)*  Further  work  has  been 
done  by^Hotelling  (19i|-7)  and  Hsu  ( 19lf-S )  • 

Two  other  papers  by  Hotelling  (1933  and  193&)  are  also 
related  to  the  problem  of  discrimination.  In  the  earlier 
paper,  he  considered  the  problem  of  finding  that  rotation  in 
p-dimensional  space  such  that  in  the  new  coordinate  system 
the  coordinates  would  be  independently  distributed.  These 
new  coordinate  directions  were  termed  by  Hotelling  the  "princi¬ 
pal  components"  of  the  given  multivariate  normal  distribution. 
It  is  interesting  that  the  essential  idea  of  Hotelling* s  work 
was  anticipated  by  Pearson  (1901).  Girshick  (193&)  showed 
the  equivalence  of  Hotelling’ s  results  with  those  obtainable 
from  the  maximum  likelihood  principle.  This  topic  is  dis- 
cussed  in  greater  detail  in  Part  II  of  the  present  monograph. 

In  1936  Hotelling  considered  the  relations  which  may  ex¬ 
ist  between  two  correlated  sets  of  random  variables.  He  show¬ 
ed  how  it  was  possible  to  rotate  the  sample  space  so  that  in 
the  new  coordinates,  the  variables  of  each  set  are  independent 
among  themselves,  while  between  the  set3  there  is  dependence 
only  between  certain  corresponding  pairs  of  variates.  These 
variates  are  called  the  "canonical  variates"  and  the  corre¬ 
lations  between  them,  the  "canonical  correlations."  This  work 
is  related  to  the  linear  discriminant  function,  since  the 
latter  may  be  viewed  as  a  canonical  variate.  Waugh  (19l|-2) 
has  illustrated  the  application  of  canonical  variates  to 


economic  data 


CHAPTER  V 


The  Linear  Discriminant  Function. 

The  first  clear  statement  of  the  problem  of  discrimi¬ 
nation,  and  the  first  proposed  solution  to  that  problem,  were 
given  by  R.  A.  Fisher  in  the  middle  of  the  1930’s.  As  was 
the  case  with  Karl  Pearson’s  CRL,  the  ideas  of  Fisher  first 
appeared  in  print  in  papers  by  other  people  (Earnard  (1935) » 
Martin  (1936)],  but  it  will  be  convenient  to  begin  with  a 
discussion  of  Fisher’s  own  first  work  on  the  subject.  This 
was  contained  in  his  paper,  "The  use  of  multiple  measurements 
in  taxonomic  problems,”  which  appeared  in  Annals  of  Eugenics 
in  1936. 

In  this  paper,  Fisher  develops  his  theory  largely  by 
means  of  working  out  a  numerical  example,  and  he  is  not  al¬ 
ways  careful  to  state  precisely  the  assumptions  which  underlie 
his  conclusions.  In  the  exposition  of  his  work  which  follows, 
it  has  been  necessary  at  various  points  to  infer  what  is  meant. 

The  general  situation  studied  by  Fisher  is  as  follows. 
There  are,  say,  two  populations,  and  IT yr0m  each 

population  we  have  available  a  sample,  say  n^  items  from 

and  items  from  There  is  then  presented  a  new 

item,  say  I  ,  which  may  have  come  from  either  TT  or  ft  . 

1  2 

The  decision  problem  is  to  assign  I  to  one  of  the  two 


populations®  The  available  information  consists  of  measure¬ 
ments  of,  say*  p  quantities,  X, ,  X„  *  * • • *  X  *  which  are 

1  P  '  p 

made  on  I  and  on  each  of  the  n~,  -§*  ng  sample  items*  . 

We  may  conveniently  approach  Fisher1 s  solution  by  con¬ 
sidering  first  the  special  univariate  case*  p  =  1«  We  then 
have  two  univariate  samples*  whose  values  may  be  represented 


numbers  x 


11* 


x. 


"‘ln- 


f  or  the  first  sample  *  by 


Xp-j  *  Xgg* •  •  • ,  xPno  ^ or  ^'ne  second  sample?  and  by  x  for  the 
individual  1*  It  seems  reasonable  to  assign  1  to  that 
group  which  it  more  nearly  resembles  as  indicated  by  the 


measurements.  We  might,  for  example,  compute  the  arithmetic 
means,  say  5E-,  and  Xg  ,  of  the  two  samples*  and  then  see  - 
to  which  of  these  means  x  is  closer.  This  is  in  fact  the 
procedure  which  Fisher  proposes.  (It  may  be  noted  that  Fisher* s 
rule  implies  that  the  two  possible  errors  of  classification  are 
treated  symmetrically.  This  matter  is  discussed  at  length  in 


Chapters  VIII  and. IX). 

Fisher  deals  with  the  multivariate  problem  by  reducing 
it  to  the  univariate  problem  just  stated.  This  is  accomplished 
by  replacing,  for  each  individual*  the  p  measurements  by  a 
single  measurement*  say  Y,  There  -are  of  course  many  differ¬ 
ent  ways  in  which  p  quantities  may  be  combined  to  produce  a 
single  quantity,  but  Fisher  considers  only  linear  combinations* 


X^  4  *  *  *  4- 
2  2 


We  may  here  use  any  set 


^-p 


Merits  (  A. 


x 


2* 


X  ) 

p 
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and  the  major  accomplishment  of  Pisher  is  to  give  a  reason¬ 
able  solution  to  the  problem  of  choosing  the  coefficients  in 


the  most  advantageous  way. 

"bh 

Let  us  denote  the  measured  value  of  the  j  quantity  on 
the  kth  individual  in  the  l  ^  sample  by  x^  i  =  1>2,; 

j  =  1,2, ***#p;  k  =  1,2, • • • ,n^;  and  denote  the  measured  value 
of  the  jth  quantity  on  I  by  x..  Correspondingly,  let 

J 

(D  ylk  =  \  xilk  +  ^2  xi2k  +  **•  +  \  xlpk 


y  =  V  x,  +  •••  +  X  x  . 
11  p  p 


The  appropriatness  of  the  choice  of  values  for  X^,  X^,***,  X^ 
may  be  measured  by  the  relative  ease  of  classifying  I  through 
use  of  the  numbers  y  and  yik-  If  the  two  y  samples  are 
widely  spaced  and  each  is  tightly  clustered  about  its  own  mean: 


x  xxxx  x  x 


o  ooooo  o  o 


it  will  in  general  be  easier  to  make  a  correct  decision  about 
I  than  if  the  y  samples  overlap: 


x  xoxo  x  ox  o  o 


Pisher  introduces  a  numerical  measure  of  the  ease  of  dis- 
tinguishing  between  the  two  populations.  This  is  the  ratio: 


difference  of  sample  means 
Standard  error  within  samples 
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He  then  is  able  to  suggest  a  reasonable  criterion  for  deter¬ 
mining  appropriate  values  of  A  ,  X  ,•••,  X  »  ’’what  linear 

^  p 

function  of  the  ...  measurements  ...  will  maximize  the  ratio 
of  the  difference  between  the  {[sample"}  means  to  the  standard 
deviations  within  {[samples}] ?” 

Mathematically,  the  problem  is  to  maximize  the  ratio 


where  =  _i_  V*  y  ,  and  V  denotes  summation  over  the 

i  ^ 

sample.  We  do  not  need  to  divide  the  denominator  by  the 
constant  n^  +  n^  -  2,  since  constant  factors  do  not  affect 
the  maximization  problem,  and  we  may  equally  consider  the 
square  of  (3),  since  this  is  more  convenient  mathematically 
and  since  the  non-negative  quantity  will  b e  maximized  when  its 
square  is  maximized. 

A  little  computation  shows 

V  -  27  X  dj 

j=l 


where  d^  -  _L.  X2jk  "  ~  is  the  difference  in 

sample  means  for  the  quantity,  and 


i  (yi*  ■  fi>*  +  2?  (y2k  -  f/  -£ 


P  P 


j=l  m=l 


m  jm 


/ 


where  S^  is  the  pooled  sum  of  products  of  deviations  from 

the  sample  means  of  traits  j  and  m: 

2 

Sjm  *  22  22±  (xijk  “  xij)(ximk  “  ximJ* 

i=l 


Here  the  quantities 


dj  and  S ^  are  computed  from  the 


sample  measurements. 

Our  problem  then  is  to  determine  the  values  of  the 
for  which 


r 

3=1 


m=l 


^m 


is  maximized.  Since  (Ij.)  is  not  altered  when  all  of  the  \ »  s 
are  multiplied  by  the  same  quantity,  there  will  be  many  equal¬ 
ly  good  solutions,  differing  only  by  a  constant  factor.  Ordi¬ 
nary  methods  of  the  calculus  give  the  solution.  If  we  dif¬ 
ferentiate  (k)  with  respect  to  A  and  set  the  derivative 

r 

equal  to  0,  r  =  l,2,*»*,p,  we  obtain  the  equations 


(5) 


p  p 

E  s  X  s 
as i  .1=1 


-  '<1=4  A  ,  S.  ,  r=l ,  2 ,  • « 


z;*.  d, 

3=1 


r  3=1 


Since  we  are  only  interested  in  solution  up  to  proportion- 

51  H'  A  ,  A  s 

ality,  we  may  ignore  the  factor  - — J _ S which 

2.  X  di 

J  j 


*#P 


39 


is  the  same  for  all  equations,  and  obtain  as  a  solution  the 
roots  .  of  the  equations 


S11  *1  +  S12  ^2  +  "•  +  slp  =  dl 
S21  ^1  +  S22  ^2  +  +  s2p  "Xp  =  d2 


Vi 


>1 


Sp2  * 


S  "X 
PP  ''P 


We  have  thus  in  practice  to  solve  a  set  of  p  simultaneous 
linear  equations;  and  as  was  pointed  out  by  Karl  Pearson  in 
1926  this  places  a  practical  limitation  on  the  value  of  p. 

Having  obtained  the  appropriate  ^  1  s,  we  can  now  com¬ 
pute  the  corresponding  quantities  y-^,  y2,  and  y,  accord¬ 
ing  to  (1)  and  (2).  The  problem  becomes  a  univariate  one, 
and  we  can,  for  example,  classify  I  into  Tf^  if  and  only 
if  y  is  closer  to  y-^  than  to  y2. 

It  will  be  appropriate  to  give  a  numerical  illustration, 
and  for  historical  reasons  it  seems  desirable  to  use  the  il¬ 
lustration  employed  by  Fisher.  Fisher  considers  the  problem 
of  distinguishing  between  species  of  Iris  plant  on  the  basis 
of  four  measurements  made  on  each  plant:  sepal  length,  sepal 
width,  petal  length  and  petal  width.  He  has  samples  of  $0 
from  each  of  2  species,  I.  setosa  and  I.  versicolor  (a  third 
3pecies»  I.  virginica,  is  included  in  making  genetical  appli¬ 
cations).  Fisher’s  example  is  unfortunate,  in  that  a  single 


one  of  these  characteristics  will  serve  to  do  all  of  the  dis¬ 


criminating  that  anyone  would  ever  need.  Thus,  the  £0  Iris 
setosa  plants  have  petal  lengths  running  from  1,0  to  1.9  cm 
while  the  5 0  Iris  versicolor  plants  have  petal  lengths  from 
3.0  to  5.1  cm.  Clearly  no  refined  statistical  technique  is 
needed  to  distinguish  between  such  populations! 

An  excellent  illustration  of  the  linear  discriminant 
function  may  however  be  obtained  if  we  ignore  the  figures  on 
petal  length  and  width,  and  pretend  that  only  the  figures  on 
sepal  length  and  width  are  available.  Figure  1  shows  the 
two  samples.  Iris  setosa  and  versicolor,  plotted  for  the  sepal 
measurements.  An  inspection  of  this  diagram  shows  just  where¬ 
in  the  value  of  the  linear  discriminant  function  lies.  If 
we  considered  sepal  length  and  sepal  width  separately  (see 
Figure  1)  it  would  be  quite  difficult  to  make  an  accurate  dis¬ 
crimination  because  of  the  large  degree  of  overlap  of  the  two 
samples.  But  if  we  compute  the  linear  discriminant  function, 
the  discrimination  becomes  very  good. 

"The  figures  involved  are  the  following,  letting  be 

Iris  setosa  and  IT  be  Iris  versicolor: 


A1  =  O.93O  d2 ‘=  -  O.658 

s22  =  11.8658  . 


Sll=  19  •  1^-34 


S21  =  S12  =  9.0356 


We  have  then  to  solve  two  linear  equations  in  two  unknowns: 


19.llj.34  A  9.0356  =  O.93O 

9.0356  ^  +  11,8658  \  =  -  O.658  . 
1  2 


The  roots  are  easily  found  to  be: 

(6)  V  =  +  0.1167  ^  0.1443 


Any  pair  of  numbers  proportional  to  these  would  serve  as 

well. 


A  simple  geometrical  interpretation  may  be  given  to  the 
LDP.  On  figure  1  is  drawn  the  line  a  whose  slope  is 

X2 

7“  =  -  0,8486  If  we  use,  not  the  coefficients  (6)  but  the 

A1 


proportional  coefficients 


then  the  LDP 

7  =  ^1  X1  +  ^2  X2 

amounts  to  projecting  the  points  (x1,x2)  onto  the  line  a. 
The  line  a  is  so  directed  that  projecting  the  samples  onto 
it  provides  the  maximum  possible  separation  of  the  samples. 

We  may  note  in  passing  that . in  this  particular  example, 


excellent  discrimination  could  be  obtained  by  using  the  ratio 
of  sepal  width  to  sepal  length;  this  amounts  to  a  projection 
through  the  origin  onto  a  vertical  line.  In  other  situations , 
however,  the  ratio  would  be  a  worthless  discriminator.  The 
great  virtue  of  the  LDP  is  that  it  always  projects  the  samples 
in  the  direction  which  gives  the  greatest  possible  separation* 

It  is  interesting  to  note  that  a  trait  which  of  itself 
provides  little  or  no  discrimination,  may  still  be  worth 
measuring  in  that  it  enhances  the  discriminatory  power  of 
other  traits.  An  exposition  of  this  situation  has  been  given 
by  Cochran  and  Bliss  (19lj-8). 

In  his  paper,  Fisher  makes  several  interesting  comments 
on  the  relation  of  the  LDP  to  other  statistical  techniques. 

On  the  one  hand,  the  LDP  corresponds  to  an  analysis  of  vari- 


anc e ,  with 
species  and 


(y2  -  y-^)  corresponding  to  variance  between 

(*lk  -  h)Z  +  <*2k  -  ?2)2  00rr3SPOTd- 


ing  to  variance  within  species.  On  the  other  hand,  the  LDP 

can  be  considered  as  the  solution  of  a  regression  problem. 

This  is  done  by  giving  to  each  population  a  different  value 

of  an  artificial  variable,  say  z>  and  then  regressing  z 

on  the  measurements  x_,»*»,x  .  Through  these  considerations , 

P 

Fisher  is  led  to  suggest  a  test  of  the  hypothesis  that  the 

two  populations  are  in  fact  identical.  This  test  is  identical 

p 

with  the  T  test  proposed  by  Hotelling  in  I93I,  which  has  been 


discussed  in  Chapter  IV 


In  conclusion,  it  should  be  pointed  out  that  Fisher  makes 
no  attempt  to  justify  on  probabilistic  grounds  his  definition 
of  optimum  separation,  nor  his  restriction  to  linear  combi¬ 
nations  of  the  measurements.  We  shall  see  later,  in  Chapter 
VIII  that  when  the  two  populations  are  normal  and  have  the 
same  covariance  matrix,  then  the  LDF  has  certain  optimum 
properties.  Otherwise  it  is  not  optimum. 


CHAPTER  VI 


Application  of  the  Linear  Discriminant  Function. 

Since  1935  th®  LDF  has  been  applied  to  an  amazing  variety 
of  problems.  To  indicate  the  diversity  of  the  published  ap¬ 
plications,  we  present  here  in  tabular  form  some  thirty- two 
papers.  In  each  case  we  give  the  nature  of  the  groups  being 
discriminated,  and  the  nature  of  the  observed  quantities  on 
the  basis  of  which  the  discrimination  is  effected.  We  have 
purposely  omitted  from  the  list  papers  In  which  previously 
published  data  is  reanalyzed,  such  as  Bartlett  (1947) ,  Brown 
(1947) »  Fisher  (1938b,  1940) ,  Garrett  (1943),  Park  and  Day 
(1942),  and  Penrose  (1947)* 

Not  all  of  the  applications  in  this  list  are  of  the 
simple  type  described  in  Chapter  V;  some  involve  modifications 
and  extensions  of  the  LDF  such  as  are  discussed  in  Chapter 
VII.  ^However,  in  general  the  applications  follow  a  set  pat¬ 
tern*  the  nature  of  the  groups  and  observations  are  des¬ 
cribed,  the  LDF  is  computed,  and  the  significance  of  the 
discrimination  is  tested.  Sometimes  there  is  an  enquiry  into 
the  accordance  of  the  data  with  the  assumptions  which  under¬ 
lie  the  LDF,  or  an  appreciation  of  the  relative  discriminatory 
value  of  the  different  variables  measured. 

It  may  be  noted  that  the  thirty-two  papers  listed  appeared 
in  twenty-one  different  periodicals,  most  of  which  were  not 
specifically  statistical  in  nature. 
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Authors  Date  Groups  Discriminated  Nature  of  Observed  Quantities 
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CHAPTER  VII 


Some  Modifications  and  Extensions  of  the  Discriminant  Function. 


During  the  fifteen  years  in  which  the  LDF  has  been  in  use, 
a  number  of  papers  have  been  published  which  are  concerned  with 
modifications  of  the  LDF,  designed  to  simplify  its  application, 
or  with  extensions  of  the  LDF  to  problems  somewhat  different 
from  the  classification  problem  which  led  Fisher  to  its  in¬ 
vention.  Some  of  these  results  are  briefly  described  in  the 
present  chapter. 

If  p,  the  number  of  traits  measured,  is  small  (say  2, 

3,  or  Ip,  then  there  is  no  special  difficulty  in  solving  the 
linear  equations  which  determine  the  LDF,  even  if  no  comput¬ 
ing  machine  is  available.  But  if  p  is  even  as  large  as  6 
or  8,  the  labor  involved  b egins  to  be  practically  prohibit¬ 
ive,  and  with  p  greater  than  10  few  persons  will  care  to 
tackle  the  problem  aided  only  by  a  desk  calculator.  The 


labor  involved  in  computing  the  coefficients  S.  increases 
2 

as  p  ,  and  the  labor  involved  in  solving  the  equations  in¬ 


creases  about  as  p(pl). 


For  this  reason  there  has  b  eeh  a  good  deal  of  effort  ex¬ 
pended  in  seeking  out  simple  and  reasonably  satisfactory  ap¬ 
proximate  solutions.  There  is  of  course  a  large  general 
literature  on  the  solution  of  linear  equations,  which  we  shall 


not  consider  here.  We  do  however  wish  to  discuss  some  work 
aimed  specifically  at  the  equations  arising  in  discriminatory 
analysis. 

The  first  suggested  approximation  seems  to  have  been  that 
given  by  Karl  Pearson  (1926).  He  pointed  out  that  if  the  traits 
.are  all  independent,  then  we  may  replace  the  system  of  p  line- 
ar  equations  in  p  unknowns  by  p  equations  each  involving 
a  single  unknown: 

(i)  sn  \  =  V  i-i.a.— ,p. 

These  equations  present  no  difficulty  even  if  p  runs  into 
hundreds.  Of  course,  the  Pearson  method  is  only  reasonable 
if  in  fact  the  correlations  between  traits  are  not  too  large. 
Pearson  suggested  that  the  traits- to  be  measured  might  be 
chosen  with  this  in  mind. 

Beall  (19L.5)  has  investigated  the  accuracy  of  approxi¬ 
mation  (1)  and  of  other  approximations  for  three  sets  of  data, 
computing  in  each  case  the  discriminant  ratio  obtained.  In 
one  of  his  examples  (data  from  Travers  1939)  the  correlations 
are  mostly  small;  ranging  from  -  O.I4I  to  +  0.38,  with  10 
out  of  the  15  being  between  -  0.1  and  0.1.  In  this  case, 
the  simple  equations  (1)  give  a  discriminant  ratio  of  I.27, 
which  may  be  compared  with  the  ratio  of  1.31  obtained  by  us¬ 
ing  the  correct  LDP.  But  on  another  example  (data  from  L.  S. 
Penrose),  where  the  correlations  run  from  O.3I  to  O.57,  Beall 

finds  that  (1)  yields  a  discriminant  ratio  of  0.94*  as  compared 
with  the  LDP  ratio  of  1.25, 


These  results  suggest  that  if  most  of  the  correlations 
are  small  (say  between  -  0.2  and  0.2)  with  none  of  them 
very  large  (say  an  absolute  maximum  of  0.6),  then  the  simple 
solution  (1)  may  be  used  without  much  loss  in  discrimination. 

Another  interesting  approximate  solution  has  been  given 
by  Jackson  ( 191+-3 )  •  He  postulates  that  all  of  the  correlations 
have  a  common  value,  whose  estimate  is,  say,  r,  and  corre¬ 


spondingly  replaces  the  quantity  S  by  the  quantity 
^  ^  J 

r  Vsii  sjj- 

If  we  divide  the  i^1  of  the  linear  equations  for  the  dis¬ 
criminant  function  coefficients  by  S^»  and  let  JU.  = 


S  •  •  •  e .  =  — -h 

oJ  1  \r^r 


we  obtain  the  system 


(2) 


}\ +  r  (2 +  r  . +  "■ +  r hp  - 

rl\  +  +  r  bj  +  •••  +  r  =  a2 


r  ti  +  r T2  +  r  (3  + 


©  ©  • 


These  equations  may  be  readily  solved.  Summing  them,'  we  find 

P  p 

(3)  [l  +  (p-l)r]  2  (^.  =  21  e. . 

i-1  i=l 


The  j  of  equations  (2)  may  be  written 


(1-r)  K  +  r  I  j^V  =  0i. 
J  i=l  J 
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Combining  (3)  and  (If.),  we  have 

m  (1-r)  e  .  +  pr(e,-e) 


h  = 


(1-r)  [l+(p~l)r3 


where  P  ©  “  ^  e^.  Since  we  require  a  solution  only  up  to 
i=l 

proportionality,  we  may  use 

(6)  ,  \  ~  d"r)  e.  +  P  r(e  -e). 

*  OJ  J  3  j 


There  still  remains  the  problem  of  determining  an  average 
value  of  r,  Jackson  and  Beall  sTiggest  various  estimates, 
which  are  not  very  dissimilar*  A  reasonable  one  is  Jackson's 
given  by  Beall  as: 

}/{ (i-*f  •  £  \ 

where 


( n-^+ru. 

1! 

/,i 

uif 

,2 

•Uj,) 

n2 

+  z*  ( 

x2 
i  j 

j=i 

j=i 

(n  +n 

)ui 

ni 

=  r 

x^-  + 

13 

n2 

T 

x2 

ij 

j=i 

j-i 

(Vn2 

? 

)2o 

ni 

=  T 

p 

[i 

x^ .  - 

2 

•  vj  + 

n2 

I 

i=l 

j=l 

ni 

P 

T 

nl  P 

(n1+n2 

)  v 

=  I 

Z  x 

■  j. 

13 

T  Z 

i 

X 

0=1 

i=l 

,1=1  i=: 

T 

ilk 

The  computations  here  are  not  heavy  if  a  desk  calculator  is 


available* 


In  all  of  the  examples  considered  by  Beall*  the  results 
obtained  in  using  (6)  compare  very  favorably  with  those  ob¬ 
tained  from  the  LDP.  "Where  the  LDP  gives  discrimination  ratios 
of  and  1.25,  the  Jackson  method  (6)  gives  5»00> 

1*30,  and  la2hrn  respectively.  It  should  be  remarked  however* 
that  in  all  three  examples  the  correlations  are  not  widely 
divergent. 

In  using  the  Jackson  technique*  one  should#  where  pos¬ 
sible,  give  the  scales  of  measurement  a  common  orientation, 
so  that  the  correlations  will  at  least  tend  to  have  the  same 
sign.  Thus*  if  the  measurements  are  all  related  to  intelli¬ 
gence,  then  a  high  score  on  all  tests  used  should  have  the 
same  meaning' — -either  high  intelligence  in  all  cases  or  low 
intelligence  in  all  cases.  This  result  can  be  obtained  by 
appropriate  choice  of  sign. 

In  conclusion,  we  may  state  that  the  problem  of  approxi¬ 
mate  solutions  of  the  LDP  equations  deserves  further  study, 
both  empirical  and.  theoretical.  Empirically ,  more  studies 
of  the  kind  carried  out  by  Beall  would  be  of.  interest.  Theo¬ 
retically*  one  might  seek  mathematical  b ounds  for  the  loss 
in  discriminatory  power  which  results  from  using  various  ap¬ 
proximations.  Further  approximations  might  also  be  studied, 
an  obvious  one  being  a  combination  of  those  of  Pearson  and 
Jackson. 

Pending  such  studies*  the  experimenter  may  use  the 
following  rules  of  thumb; 


5? 


(1)  If  p  is  not  too  large,  or  if  the  importance  of 
the  problem  and  accuracy  of  the  data  warrant  the 
extra  work,  the  accurate  LDP  should  be  found  ex¬ 
actly. 

(2)  If  the  correlations  are  believed  to  be  mostly 
small,  the  equations  (1)  should  be  used. 

(3)  If  the  correlations  are  sizeable  but  not  too  ■» 
divergent  in  size,  the  equations  (6)  should 

be  used,  after,  however,  so  taking  the  orien¬ 
tation  of  scales  that  the  correlations  are  all 
of  the  same  sign. 

Theoretically  the  LDP  is  designed  to  solve  the  problem  of 
assigning  an  individual  to  the  proper  one  of  two  populations. 
However,  from  the  very  beginning  (Barnard  1935)  the  technique 
has  been  employed  .with  more  than  two  populations.  It  Is  clear 
that  a  single  linear  function  will  do  a  good  job  with  more 
than  two  populations  only  when  these  populations  are  collinear 
--that  is,  when  the  changes  in  the  means  of  the  p  traits, 
from  one  population  to  another,  are  proportional.  As  Is 
customary  in  applied  statistics,  an  assumption  which  under- 
lies  a  theoretical  result  need  not  be  exactly  satisfied  for 
the  result  to  be  usable.  But  if  the  populations  are  not  at 
least  approximately  collinear,  useful  information  will  be  lost 
if  classification  is  carried  out  through  use  of  the  LDP.  It 
is  possible  to  test  the  hypothesis  of  ccllineapity — tests  have 
been  proposed  by  Fisher  (1938),  Bartlett (19!p7fc»)  ,  and  Rao  (19lj8b). 
The  two  latter  authors  reexamine  Barnard’s  (1935)  data,  and 
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find  that  the  linearity  assumption  is  not  reasonable  in  that 
case.  A  visual  inspection  of  Barnard’s  data  will  lead  to  the 
same  conclusion.  One  might  almost  make  it  a  postulate  that 
if  the  samples  are  large,  a  test  of  collinearity  will  lead 
to  rejection.  This  may  still  not  preclude  the  reasonable¬ 
ness  of  using  the  LDP,  if  the  departure  from  linearity,  while 
significant,  is  not  large.  If  it  is  large,  one  may  employ 
more  than  one  discriminant  function.  This  procedure  is  dis¬ 
cussed  by  Bao  (19lj.8c)  and  Brown  (1947) »  as  well  as  in  the 
papers  just  cited.  For  a  different  approach,  see  Day  and 
Sandomire  (1942). 

In  the  practical  applications,,  after  the  LDP  has  been 
found,  it  is  natural  to  enquire  whether  some  of  the  vari¬ 
ables  contribute  enough  to  the  discrimination  to  warrant 
their  continuance  in  further  studies.  The  problem  is  compli¬ 
cated  by  the  fact,  mentioned  in  Chapter  V,  that  the  contri¬ 
bution  of  a  variable  to  the  discrimination  may  be  indirect. 
The  problem  of  omitting  a  variable  from  a  discriminant  func¬ 
tion  is  not  essentially  different  from  that  of  omitting  a 
variable  from  a  multiple  regression.  Aside  from  empirical 
discussions  (such  as  that  in  Barnard,  1935  and  other  appli-  , 
cational  papers),  various  authors  have  proposed  tests  of  the 
additional  discriminatory  power  contributed  by  a  particular 
trait  or  traits.  For  discussion  of  the  numerical  problems 
involved  in  dropping  a  variate,  see  Cochran  (1938)  and 
Quenouille  (1949a,  1949^)* 


The  LDP  has  proved  to  be  a  valuable  tool  in  fields  of 
application  other  than  that  for  which  it  was  originally  in¬ 
tended,  There  is  a  tendency  in  the  literature  to  term  any 
linear  combination  of  measurements?  in  which  the  coefficients 
are  adjusted  to  achieve  some  optimum  effect,  a  ’’discriminant 
function,”  even  though  the  effect  sought  is  not  the  specific 
discrimination  of  groups.  This  extension  is  not  of  course 
directly  pertinent  to  the  problem  of  classification,  and 
will  be  dealt  with  briefly® 

An  early  example  of  such  an  extended  use  of  the  LDP  is 
provided  by  H.  P.  Smith  (1937)?  who  found  that  linear  func¬ 
tion  of  several  observed  characteristics  of  wheat  which  corre 
lated  most  highly  with  a  compound  of  the  corresponding  quali¬ 
ties  representing  economic  value®  Further  examples  of  ex¬ 
tension  of  the  LDP  arise  when  one  seeks  to  assign  scores  to 
qualitative  characters  in  such  a  way  as  to  maximize  some  ef¬ 
fect.  Examples  of  this  process  may  be  found  in  Fisher  (1925« 
194^*  PP  289-295) ,  Fisher  (194&)j  Maung  (19lpL)  and  Johnson 

(1950). 

The  extended  LDP  has  even  been  used  to  effect  a  gener¬ 
al  attack  on  problems  of  multivariate  analysis  (Rao,  1948b) . 
Recall  that  in  Chapter  V  we  introduced  the  LDP  as  that 
(linear)  reduction  of  a  multivariate  problem  to  the  corre¬ 
sponding  univariate  problem,  which  would  effect  the  best 
separation  of  the  univariate  samples'®  More  generally,  in 
performing  multivariate  tests  of  significance,  we  may  seek 
that  linear  reduction  of  the  data  which  makes  greatest  the 


apparent  significance  oeing  tested.  The  tests  obtained  in 
this  way  cannot  in  general  be  dealt  with  through  solving 
systems  of  linear  equations,  but  the  test  statistics  obtain¬ 
ed  are  functions  of  the  roots  of  a  determinantal  equation  of 
the  form  |A  -  A  B  |  =0,  where  A  and  B  are  p  x  p  samplej 
covariance  matrices.  The  sampling  theory  of  these  roots  and 
of  the  test  statistics  which  depend  on  them  is  very  compli¬ 
cated  and  will  not  be  dealt  with  here.  In  general,  the 
distributions  involved  have  not  been  tabled,  but  large- 
sample  approximations  are  available.  References  to  some  of 
the  literature  are  given  in  the  bibliography.  See  Anderson 
(19ij-6,  1948)  »  Anderson  and  Girshick  (1944)*  Anderson  and 
Rubin  (1949),  Bartlett  (1938,  \Q}±1,  1947a),  Fisher  (1938b, 
1939*  19^0 ) ,  Girshick  (1939)*  Hsu  (1939*  1940,  1941a,  1941b, 
1941c,  194ld),  Rao  (1946,  1948b),  and  Roy  (1939c 1  1940a, 

194°h,  1942a,  1942b»  1945 »  194^a»  1946b) •  In  his  lectures 
at  Columbia,  Anderson  (1947)  has  given  a  thorough  treatment 
from  the  likelihood  ratio  viewpoint.  General  surveys  have 
also  been  made  by  Bartlett  (1947b)  and  by  Tukey. 
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CHAPTER  VIII 


Classification  from  the  Point  of  View  of  Probability  of  Error. 

The  distinguishing  feature  of  the  modern  theory  of  sta¬ 
tistical  inference  is  the  focusing  of  attention  on  the  proba¬ 
bilistic  behavior  of  statistical  procedures.  The  approach  of 
the  linear  discriminant  function  to  the  classification  problem 
is  essentially  Intuitive  rather  than  probabilistic t  we  ask, 
what  linear  combination  of  the  measurements  best  separates 
the  samples?  The  philosophy  underlying  the  LDP  is  very  simi¬ 
lar  to  that  which  motivated  the  development  of  the  analysis 
of  variance  by  Pisher  in  the  1920* s. 

The  development  of  a  theory  of  statistical  tests,  as 
distinct  from  a  collection  of  special  examples,  may  be  said 
to  have  begun  with  the  introduction  of  the  notion  of  types  of 
error  by  Neyman  and  Pearson  in  1928  and  1933*  Corresponding¬ 
ly,  the  initiation  of  a  theoretical  attack  on  the  classifi¬ 
cation  problem  may  be  said  to  have  begun  when  the  Neyman- 
Pearson  ideas  were  adapted  to  the  discriminant  function  by 
Welch  in  1939*  Welch* s  results  were  published  in  a  brief 
note,  but  the  ideas  involved  are  of  sufficient  importance  to 
warrant  a  rather  full  discussion. 

Welch  considers  only  the  problem  of  classifying  an  in¬ 
dividual  into  one  of  two  populations,  say  and  TT  ,  and 


further  restricts  the  problem  by  'assuming  that  the  proba¬ 
bility  density  function  of  the  measured  quantities  is  com¬ 
pletely  known  within  each  of  the  populations.  Let 
f1(x1,x2» *  *  * »xp)  denote  the  probability  density  of  the  ob¬ 
servable  quantities  X^,X2>*,*»Xp  in  TT^,  and  l®t 
f2(x, ,x2»» • • >Xp)  be  the  corresponding  density  in  TP £• 

Welch  observes  that  any  method  of  classifying  an  indi¬ 
vidual  I  into  one  of  the  two  populations  on  the b  asis  of 
observations  on  X]_,  X2»***»  Xp  >  amounts  to  a  partition  of 
the  p-dimensional  "sample  space"  of  the  X* s  into  two  ex¬ 
haustive  and  mutually  exclusive  regions,  say  R^  and  R2, 
with  the  rule  that  I  will  be  assigned  to  It'  if  the  random 
point  with  coordinates  (X^,  X0,»»»,Xp)  falls  into  R^,  and 
will  be  assigned  to  TT^  if  (X^,  X2,**»,  Xp)  falls  into  R2. 
The  choice  of  a  rule  for  classification  or  discrimination  is 
thus  equivalent  to  the  choice  of  a  partitioning  of  the  sample 
space  into  the  regions  R^  and  R2» 

Welch  further  proposes  a  criterion  on  the  basis  of  which 
the  various  possible  partitions  may  be  compared  as  to  their 
desirability.  He  suggests  that  a  partition  (or  rule  for 
classifying  I)  be  judged  on  the  basis  of  the  probabilities  of 
misclassif ication  which  arise  when  the  rule  is  employed. 

Two  forms  of  the  problem  are  treated.  First,  Welch  sup¬ 
poses  that  there  exist  a  priori  probabilities  that  I  comes 
from  the  two  populations,  say  probability  that  I  does 

in  fact  belong  to  IT  and  probability  p2  that  I  belongs  to 


Tf  .  Here  of  course  +  p2  =  1.  Using  the  method  of  3ayes , 
we  may  compute  the  a  posteriori  probabilities  that  I  belongs 
to  and  to  1*  .  These  values  are 

_ _  Plfl(xi*x2***  *  ,Xp) _ 

PlfI(xl,x2’"',xp)  +  P2f2<xi'x2'”->xp) 

p2f2^Xl,X2,,# • ,Xp) 

Piri(xl’x2’-"’xp)  +  P2f2(xl’x2’*w'xp) 

respectively.  We  may  then  assign  I  to  that  population  whose 
a  posteriori  probability  is  greatest.  This  procedure  coin¬ 
cides  with  that  which  is  obtained  if  we  compute  the  likelihood 
ratio 

f  (x  ,x  ,••• ,x  ) 

\  =  ■  iL  -in,  ,  ntft  P 

“  f2(xl,x2,*,‘ ,Xp} 

and  assign  I  to  if  x  >  otherwise  assigning  I  to 

1  Pl 

ir2,  Welch  asserts  (as  is  easily  shown)  that  these  equivalent 
rules  lead  to  the  minimum  possible  probability  of  misclassifi- 
cation. 

The  solution  obtained  by  Welch  under  the  assumption  of 
the  existence  of  a  priori  probabilities  had  an  historically 
interesting  precursor.  In  1898*  Heincke  was  led,  in  his  study 
of  the  races  and  varieties  of  herring  in  the  North  Sea,  to 
attempt  a  probabilistic  solution  of  the  species  problem. 

Heincke  noticed  that  whereas  each  of  several  observable  traits 
of  the  herring  would  provide  some  information  as  to  the  variety 


none  of  these  traits  considered  alone  would  enable  him  to  make 


a  sufficiently  accurate  classification,  Ke  thus  sought  a 
method  which  would  enable  him  to  combine -the  information  ob¬ 
tained  from  several  observed  traits.  The  distinguishing 
features  of  his  work  were,  first,  that  the  variables  he  con¬ 
sidered  were  primarily  discrete  instead  bf  continuous,  and 
secondly,  that  he  made  the  assumption  of  equal  a  priori  proba¬ 
bilities,  That  is,  if  there  were  three  possible  varieties 
from  which  a  given  herring  might  have  come,  Heincke  assumed 
that  there  was  a  1/3  chance  that  the  herring  came  from  each# 
Heincke' s  principle  of  classification,  granting  his  assumption, 
has  a  distinctly  modern  sound:  "Das  Individuum  muss  schliess- 
lich  derjenigen  Rasse  zugezahlt  werden,  fur  die  das  Produkt 
der  Wahrscheinlichkeiten  aller  Eigenschaf ten  ein  Maximum  1st." 

Heinecke's  assumption  of  equal  a  priori  probabilities 
corresponds  to  the  ancient  "principle  of  insufficient  reason." 
However,  from  the  frequency  interpretation  of  probability  here 
adopted,  this  assumption  would  be  reasonable  only  if,  say,  the 
herring  had  been  drawn  at  random  from  a  master  population  in 
which  the  three  varieties  were  mixed  in  equal  proportions.  In 
general,  the  validity  of  the  assumption  of  a  priori  probabili¬ 
ties  seems  to  be  restricted  in  applications.  .An  interesting 
example  in  which  there  existed  known  a  priori  probabilities 
was  considered  by  Martin  (1936).  Here,  skulls  and  jawbones 
were  recovered  from  a  large  grave,  but  in  the  recovery  pro¬ 
cess  the  jawbones  became  disassociated  from  the  skulls.  In 
the  sexing  of  the  material,  it  is  considerably  easier  to  attach 


the  correct  sex  to  a  skull  than  to  a  jawbone.  Thus  (con¬ 
siderably  simplifying  the  problem  for  purposes  of  illustration) 
we  might  say  that  we  know  the  proportion  of  male  and  female 
jawbones,  and  can  use  these  proportions  as  known  a  priori  proba¬ 
bilities.  The  example  is  exceptional,  however,  and  on  the 
whole  a  solution  of  the  problem  which  does  not  involve  the 
assumption  of  known  a  priori  probabilities  is  more  frequent¬ 
ly  needed.  We  may  remark  that  it  is-  easy  to  show  that  a 
formulation  in  which  there  are  assumed  to  exist  a  priori  proba¬ 
bilities  which  are  however  unknown,  does  not  essentially  differ 
from  a  formulation  in  which  no  a  priori  probabilities  are  as¬ 
sumed  to  exist.  (In  the  language  of  Wald’s  theory,  this  amounts 
to  saying  that  the  class  of  Bayes  solutions  is  complete.  This 
point  is  discussed  further  in  Chapter  IX.) 

Heincke’ s  work  was  the  stimulus  of  a  line  of  research  on 
the  European  continent  which  seems  to  have  been  rather  inde¬ 
pendent  of  the  researches  which  are  the  main  subject  of  this 
paper.  Of  this  European  work  we  may  mention  that  of  Zarapkin 
(193^-) »  Kozminski  (1936),  and  Cavalli  (19i|5)»  Zarapkin  modi¬ 
fied  the  Heincke  method,  and  Cavalli  considered  the  relative 
merits  of  the  methods  of  Heincke  and  Zarapkin.  These  re¬ 
searches  do  not  seem  to  have  contributed  much  to  the  main 
stream  of  discriminatory  analysis. 

The  biological  problem  of  species,  has,  naturally,  been 
the  stimulus  of  a  great  deal  of  work  on  the  classification 
problem.  We  have  already  seen  that  Karl  Pearson  began  with 
the  problem  of  human  racial  classification,  and  Fisher’s 


first  paper  on  the  discriminant  function  was  concerned  with 
taxonomy.  Heincke,  Kozminski,  Zarapkln,  and  Cavalli  were 
similarly  motivated.  In  this  connection  there  is  a  wealth 
of  material  on  mathematical  definition  of  species,  mostly 
not  of  a  probabilistic  nature.  See,  for  Instance,  Joyce 
(1912),  where  an  idea  of  Soper’s  Is  usedj  Williams  (1929); 
and  G-insburg  ( 1938 ) »  who  uses  the  notion  of  probability  of 
misclassification  to  define  degrees  of  biological  dissimi¬ 
larity. 

Welch  also  considers  a  second  form  of  the  problem,  in 
which  no  a  priori  probabilities  are  postulated.  Here  there 
are  two  probabilities  of  misclassification  to  be  considered: 

(1)  The  probability  of  classifying  I  into  1T2 
when  in  fact  I  belongs  to  ‘TT  , 

(2)  The  probability  of  classifying  I  into 
when  in  fact  I  belongs  to  IT 

Let  us  denote  these  probabilities  by  P(R2|IT^  )  andp(R1|1T2) 
respectively.  Welch  states  (as  again  is  easily  shown)  that 
to  minimize  these  two  probabilities  of  misclassification,  sub¬ 
ject  to  the  condition  that  they  are  equal,  one  again  employs 
a  partitioning  of  the  likelihood  ratio  kind.  The  region  R1 
consists  of  those  points  for  which  X  >  k,  the  value  of  k 
being  chosen  so  that  P (R^  |  )  =  P(R^  |  TT ^ . 

The  reader  acquainted  with  the  Neyman-Pearson  theory  will 
recognize  the  foregoing  as  a  slight  modification  of  the  fact 
that  the  most  powerful  test  of  a  simple  hypothesis  against  a 
simple  alternative  is  that  based  on  the  likelihood  ratio 


principle  (Neyman  and  Pearson,  1933a)»  The  only  novelty  in 
Welch’s  work  is  that  the  two  types  of  error  are  treated  sym¬ 
metrically,  whereas  in  the  usual  formulation  of  the  hypothesis 
testing  problem,  the  two  types  of  error  are  treated  differ¬ 
ently:  we  place  a  preassigned  limit  (called  the  level  of 

significance)  on  the  probability  of  one  error,  and  then  seek 
to  minimize  the  probability  of  the  other  error.  Symmetri- 
zation  of  the  problem  does  not  alter,  its  essential  mathemati¬ 
cal  nature. 

Welch  concludes  his  brief  note  by  considering  an  example. 

Ee  supposes  that  X^,  X,-)»***iXp  have  a  joint  normal  distri¬ 
bution  with  a  known  covariance  matrix  II  (T |(  which  i3  the  same 
in  the  two  populations,  the  two  populations  thus  differing 
only  in  the  (known)  expectations.  Formally, 

£  ®i3(xi-eik>(xr9jk 

i=l  j=l 

*  e 


'k^xl,x2* 


*  *  *  *Xr\  )  ®  ( - - - *  —  ■-  - 

P  \(2W  )  {W\ 


t 


k  =  1,  2. 


When  we  form  the  likelihood  ratio  the  constant  factors  cancel 


and  the  exponential  factors  combine  to  give 


Thus  )i  is  a  monotone  function  of  the  double  sum  in  the 
exponent,  which  may  be  simplified.  We  obtain 


-  log  a  f  r 
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The  first  term  on  the  right  is  independent  of  the  sample  points 
so  ^  is  a  monotone  function  of  the  second  term.  But  the 
latter  is  that  to  which  Fisher's  LDF  simplifies  if  the  popu¬ 
lation  expectations  and  covariance  matrix  are  known.  Thus 
Welch's  work  puts  a  theoretical  basis  under  the  LDF»  at  least 
in  a  special  case. 

It  is  important  to  observe  the  essential  nature  of  the 
assumption  that  the  two  populations  have  the  same  covariance 
matrix.  Without  this  assumption,  the  likelihood  ratio  does 
not  simplify  as  much  as  before,  and  we  find  that  )i  is  a 
monotone  function  of  a  quadratic  function  of  the  sample  values. 
We  are  then  led  to  a  quadratic  rather  than  to  a  linear  dis¬ 
criminant  function.  Smith  (19lf-7)  has  Introduced  and  employed 
these  quadratic  discriminators.  The  theory  of  the  quadratic 
discriminant  functions  has  not  yet  been  extensively  developed. 

From  the  applicational  point  of  view,  Welch's  results 
are  obtained  under  rather  severe  restrictions.  Two  of  these 
were  removed  in  19lj-5  by  von  Mises.  Von  Mises  considered  the 
problem  of  classifying  the  individual  into  one  of  several  popu¬ 
lations,  say  IT  ,  TT  IT  >  instead  of  only  two;  and  fur- 

JL  c 

ther,  he  was  able  to  remove  the  rather  undesirable  restriction, 
imposed  ab  initio  by  Welch,  that  the  two  probabilities  of 
misclassif ication  should  be  equal.  If  there  are  k  popu¬ 
lations,  then  the  nuitiber  of  possible  errors  of  classification 
is  k(k-l) ,  since  the  individual  may  belong  to  any  of  the  k 
populations,  and  then  may  be  misclassif led  into  any  of  the 
k-1  remaining  populations.  Thus,  the  two  population  problem 
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gives  1,2  s  2  errors,  the  three  population  problem  gives 
2*3  =  6  errors,  etc.  The  problem  thus  becomes  very  rapidly 
more  complicated  with  increasing  k.  We  can  effect  a  con¬ 
siderable  simplification  if  we  focus  attention  not  on  the 
misclassifications  but  on  the  correct  classifications,  for 
there  are  only  k  of  the  latter.  In  the  case  k  =  2,  we 
get  the  same  results  whether  we  consider  misclassifications 
or  correct  classifications,  since  there  are  two  of  each  and 
their  probabilities  are  complementary  by  pairs.  In  the 
general  problem,  however,  a  real  simplification  is  implied 
by  considering  the  correct  classifications.  This  amounts  to 
treating  all  errors  alike  for  a  given  true  population,  but 
permitting  the  errors  to  be  considered  differently  as  the 
true  population  is  changed.  We  may  extend  our  former  no¬ 
tation,  letting  the  sample  space  be  partitioned  into  k 
regions  R.^,  R2,*“,  ,  with  the  rule  that  I  shall  be  as¬ 

signed  to  1T1  if  and  only  if  the  sample  point  x=(x1 ,x2, • • • ,xp) 
falls  into  R^.  Again,  P(R^|  TT^.)  will  denote  the  probability 
that  I  will  be  assigned  to  Tt^,  given  that  I  belongs  to 


Von  Mises  formulated  the  problem  in  the  following  terms: 
what  classification  procedure  will  maximize  the  minimum  of  the 

probabilities  P(R^|TT^)  of  correct  classification?  (It  may 
be  noted  that  this  formulation  of  the  problem  amounts  to  a 
completely  symmetric  way  of  viewing  the  errors.. )  As  did 
Welch,  von  Mises  considers  that  the  random  variables  to  be 
observed  have,  within  each  of  the  k  populations,  known 


density  functions,  which  we  may  denote  by  f^(x^,X2» * • • ,Xp) , 
i  =  Using  the  methods  of  the  calculus  of  vari¬ 

ations,  von  Mises  obtains  the  results  in  the  following  terms; 
"The  partition  of  the  x-space  that  solves  our  problem  is 
characterized  by  two  properties;  (1)  for  all  k  regions  Rj_ 
the  value  of  P(R.  |1Y.)  is  the  same;  (2)  along  the  border  be- 

X  1 

tween  R.  and  R.  the  ratio  f^(x)/f.(x)  is  constant."  Thus, 
Welch's  assumption  of  equal  probabilities  of  incorrect  (and 
hence  of  correct)  classification  comes  out  as  a  consequence 
in  von  Mi3es  work,  and  again  the  optimum  partition  of  the 
sample  space  is  that  given  by  the  ratio  of  the  likelihoods. 

The  reader  who  is  acquainted  with  recent  developments  in 
the  theory  of  statistical  decision  functions  will  have  recog¬ 
nized  that  von  Mises'  formulation  of  the  problem  (i.e.,  the 
maximization  of  the  minimum  probability  of  correct  classifi¬ 
cation)  is  an  illustration  of  the  minimax  principle.  This 
principle,  which  seems  to  have  been  introduced  into  the  theory 
of  statistics  by  Neyman  and  Pearson  in  1933 »  has  been  the  sub¬ 
ject  of  a  great  deal  of  modern  development  primarily  by  Abraham 
Wald.  Chapter IX  is  devoted  to  the  application  of  Wald's  ideas 
to  the  classification  problem. 

The  main  practical  disadvantage  of  the  work  of  Welch  and 
von  Mises  lies  in  the  assumption  made  by  these  writers  that 
the  parameters  of  the  normal  distributions  are  all  known.  The 
Y/elch  test  statistic 

P  P  n 

2  2°  <9ji  -  ej2>  *i 
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involves  all  of  the  population  parameters.  In  the  great 

majority  of  applicational  problems  we  do  not  know  the  values 

of  ,  0  ,  and  0  ,  but  must  rely  on  estimates  of  these 

jl 

quantities  obtained  from  samples. 

The  problem  of  the  estimation  of  the  normal  parameters 
from  sample  values  arises  in  two  main  forms: 

(1)  there  are  available  samples  of  known  origin 
from  Tl1  and  IT  , 

(2)  the  samples  are  intermingled,  so  that  we  do 
not  know  for  any  individual  in  the  sample  the 
true  population  of  origin. 

The  second  form  of  the  problem  is  of  course  much  harder  than 
the  first.  An  approach  to  its  solution,  in  the  case  of  uni¬ 
variate  normal  samples,  was  made  by  Karl  Pearson  in  l89ij.  by 
means  of  his  method  of  moments.  This  technique  has  not  work¬ 
ed  well  in  practice  (see  Martin  1936)  and  is  not  theoretically 
efficient.  Fisher’s  maximum  likelihood  method  provides  a 
theoretically  better  solution.  However,  the  fact  is  that  it 

is  extremely  difficult  to  decompose  a  mixture  of  two  normal 
populations  unless  the  populations  are  very  well  separated, 

so  that  the  sample  has  two  clear  modes. 

Rao  (I9I1.8)  has  considered  a  problem  of  this  kind.  He 
considers  the  observed  frequency  distribution  of  heights  of 
454  plants,  supposed  to  be  of  two  different  types  but  botani- 
cally  indistinguishable.  Assuming  equal  variances  for  the 
two  types,  Rao  estimates  that  the  sample  is  drawn  from  a  com¬ 
pound  population  obtained  by  mixing  in  proportions  57% and 


k-3%b wo  normal  populations  whose  means  differ  by  about  1  l/2 
standard  deviations.  He  decides,  ’’these  estimates  can  be 
safely  used  for  interpreting  differences  in  heights,”  and 
checks  his  goodness  of  fit  with  a  chi-square  of  1,30  on  6 
degrees  of  freedom. 

The  trickiness  of  problems  of  this  kind  is  made  clear 
by  the  following  observation,  due  to  Dr.  Fix:  if  we  fit  a 
single  normal  distribution  to  the  same  data,  we  may  obtain  a 
fit  whose  chi-square  is  0.68  with  8  degrees  of  freedoml  Thus 
we  obtain  a  better  fit  with  the  simpler  model.  This  fact 
makes  one  doubt  that  there  is  much  safety  in  Rao’s  interpre¬ 
tation  of  height  differences,  and  points  out  that  there  is 
little  hope  of  reliable  results  in  resolving  mixtures  of 
normal  populations  unless  the  samples  are  extremely  large 
(in  which  case  departure  from  exact  normality  would  cause 
trouble)  or  unless  the  population  means  are  separated  by  a 
good  deal  more  than  1  l/2  standard  deviations. 

Fortunately,  samples  of  known  origin  are  usually  avail¬ 
able  so  that  the  problem  of  estimating  the  population  parameters 
arises  in  its  simpler  form.  The  obvious  modification  of  the 

Welch  test  statistic 

£  X  (9n  -  V 

i=l  j=l 

is  to  replace  the  unknown  parameters  by  their  estimates.  Thfs 
is  in  fact  what  the  LDF  does;  it  corresponds  to  the  extension 
of  the  likelihood  ratio  principle  to  the  composite  hypothesis 


case?  in  which  one  considers  the  ratio  of  maximized  likeli¬ 
hoods. 

In  19>|J|  ,  Wald  considered  the  problem  of  finding  the  dis¬ 
tribution  of  a  statistic  obtained  in  the  manner  just  suggested 
i  1  1  i 

If  s  is  the  estimate  for  <5  J,  so  that  s^  is  the  usual 
unbiased  estimate  for  <5^  obtained  from  the  pooled  sample 
data,  if  x^  and  y^  represent  the  arithmetic  means  of  the 
sample  measurements  on  the  ith  trait  in  the  two  samples  re¬ 
spectively,  and  if  (z^Zg, •  ,zp)  represent  the  measurer- 
ments  on  the  individual  I  to  be  classified,  then  Wald’s 
statistic  is 

u  ~  22  72  s1^  z  (y.  -  x  .). 

i=l  j=l  J  J 

The  relation  of  U  to  the  LDP  is  clear.  Wald  gives  the  large 
sample  distribution  of  U  (this  being  essentially  the  approach 
of  Fisher  in  193$)  an<i  investigates  the  exact  distribution  of 
U.  His  results  are  not  simple,  and  are  not  in  a  form  avail¬ 
able  for  applicational  use.  Further  work  on  the  distribution 
needs  to  be  done  to  make  Wald’s  results  more  readily  available 

'•for  applications.  In  this  connection,  see  Harter  (1930)  • 

A  lengthy  paper  on  the  classification  problem  was  pub¬ 
lished  by  Rao  in  19i|8.  The  paper  consists  of  three  parts, 
the  second  and  third  of  which  are  concerned  with  the  problem 
of  arranging  a  system  of  populations  into  a  hierarchial  order, 
and  are  hence  not  directly  pertinent  to  discriminatory  analy¬ 
sis.  In  the  first  part,  Rao  reobtains  the  1945  results  of 


von  Mises,  and  extends  them  in  several  ways.  He  develops 
further  his  suggestion,  introduced  in  19l|7 »  that  the  classifi¬ 
cation  problem  be  modified  to  permit ■ classification  not  to 
be  made  in  certain  cases.  Thus,  the  sample  space  is  partition- 
ed  into  k  +  1  parts,  the  usual  classification  regions 
R  ,  R^,***,  R^.,  and  a  remaining  part  Rq  with  the  rule 

that  if  the  sample  point  falls  into  Rq  no  decision  will  be 
reached,  it  is  of  course  true  that  in  many  applicational 
situations  circumstances  compel  a  decision  to  be  reached;  but 
there  are  problems  in  which  the  contrary  is  true,  and  for 
these  cases  the  Rao  method  permits  the  construction  of  a 
classification  rule  with  preassigned  limits  on  all  of  the 
probabilities  of  misclassif ication. 

Rao  extends  to  several  populations  the  Welch  solution  of 
the  classification  problem  with  known  a  priori  probabilities. 

He  adopts  the  idea  of  Heincke  that  if  nothing  is  known  about 
the  a  priori  probabilities,  they  may  be  assumed  to  exist  and 
all  to  be  equal.  Rao  gives  explicit  statements  of  the  like¬ 
lihood  principle  in  a  variety  of  special  cases. 

Another  recent  work  of  interest  is  the  19ij-9  paper  of 

Hoel  and  Peterson.  These  authors  presume  the  existence  of 
a  priori  probabilities,  and  first  obtain  the  same  extension 
of  Welch* s  work  to  k  populations  which  was  obtained  by  Rao, 
They  then  suppose  that  the  a  priori  probabilities,  while  still 
existing,  are  not  known  but  may  be  estimated  from  a  sample. 
There  may  also  be  unknown  parameters  in  the  densities 
f  (x  ,x_,***,x  ).  A  set  of  estimators  will  be  called  opti- 

l  1  2  p 


mum  if  it  maximizes  the  probability  of  correct  classification. 
The  authors  then  consider  conditions  under  which  the  maximum 
likelihood  estimates  will  be  asymptotically  optimum  in  this 
sense. 

The  Hoel-Peterson  paper  suggests  the  following  question, 
which  seems  to  be  interesting.  A  more  general  formulation  of 
the  definition  of  optimum  would  be  as  follows:  that  classifi¬ 
cation  procedure  is  optimum  which  maximizes  the  probability 
of  correct  classification.  We  may  then  ask,  does  .this  defi¬ 
nition  coincide  with  that  of  Hoel  and  Peterson — that  is,  can 
best  use  of  the  sample  information  be  made  by  first  estimat¬ 
ing  the  a  priori  probabilities  and  parameter  values,  and  then 
proceeding  to  classify  as  if  these  estimates  were  known  to  be 
correct?  An  answer  to  this  question  should  be  possible,  using 
the  methods  of  the  general  theory  of  statistical  decision 
functions. 

Problems  which  are  essentially  classif icatory  arise  con¬ 
stantly  in  the  field  of  medical  diagnosis:  the  physician 
must  assign  the  patient  to  one  of  several  categories,  which 
may  be  taken  to  correspond  to  the  state  of  health  and  to  the 
various  diseases  under  consideration,  or  to  various  classifi¬ 
cations  of  severity  of  a  disease.  Not  much  work  seems  to 
have  been  done  toward  the  construction  of  a  probabilistic 
theory  for  diagnosis,  perhaps  through  reluctance  to  treat 
diagnosis  as  a  chance  phenomenon.  A  beginning  wa3  made  re¬ 
cently  by  Neyman  (19i|.7)»  who  proposed  a  simple  probabilistic 
model  which  will  account  for  observed  variation  in  X-ray  diag- 


nosis  for  tuberculosis.  Chiang  and  Hodges  (191+8)  have  con¬ 
tinued  this  line  of  work.  An  interesting  possibility  is  that 
sequential  diagnostic  schemes  might  be  considered  proba¬ 
bilistically.  Sobel  has  initiated  an  attack  on  sequential 
solutions  of  the  classification  problem  in  his  doctoral  dis¬ 
sertation  at  Columbia  University. 

Recently  Birnbaum  and  Chapman  have  considered  a  problem 
which  is  essentially  discriminatory.  Suppose  we  wish  to  se¬ 
lect  individuals  who  have  a  high  value  of  a  quantity  Y 
which  is  not  directly  observable,  but  which  is  correlated  with 
observable  quantities  X2»  ••*,  X  .  Birnbaum  and  Chapman 

show  that  if  xq »  X2 >  • • • »  Xp ,  Y  have  a  (p+1 ) -variate  nor¬ 
mal  distribution,  selection  by  means  of  an  appropriate  linear 
combination  of  the  X's  is  optimum  in  various  senses.  For 
example,  such  a  "linear  truncation”  will  maximize  the  con¬ 
ditional  expectation  of  Y  among  those  selected,  the  freauency 
of  selection  being  fixed. 

It  is  disturbing  to  the  theoretical  statistician  that  the 
classification  of  an  individual  into  a  category  may  be  preceded 
by  other  statistical  inferences,  often  carried  out  with  the 
same  data.  It  seems  clear  that  these  preliminary  inferences 
will  alter  in  a  serious  way  the  theoretical  performance  of  the 
discrimination  itself.  There  may  even  be  a  whole  chain  of 
consecutive  inferences.  To  illustrate,  suppose  that  a  sta¬ 
tistician  is  given  a  set  of  data  consisting  of  readings  on  a 
new  serological  test.  He  may  first  test  the  homogeneity  of. 
the  data--is  there  evidence  that  the  data  come  from  more  than 


one  population?  If  he  decides  that  more  than  one  population 
is  present,  he  must  then  decide  how  many  populations  there 
are.  At  the  same  time  he  tries  to  formulate  a  probabilistic 
model  for  the  observations,  consisting  of  a  form  of  proba¬ 
bility  distribution  for.  each  population.  These  distributions 
may  contain  parameters,  which  must  then  be  estimated.  And 
finally  the  sampled  individuals  may  be  classified.  If  it  is 
desirable  that  theory  correspond  to  reality,  then  there  is 
need  for  an  inclusive  theory  which  will  allow  for  these 
multi-stage  decision  procedures. 

A  beginning  has  been  made  by  the  Hoel-Peterson  paper  dis¬ 
cussed  earlier,  where  the  estimation  and  classification  s  tages 
are  analyzed  together.-  In  another  interesting  paper  Paulson 
(19^9)  considers  the  problem  of  grouping  individuals  into  a 
’’superior”  group  and  an  ’’inferior”  group,  or  else  of  deciding 

that  all  of  the  individuals  are  ’’neutral.”  This  amounts  to 
a  two-stage  procedure:  first  we  decide  whether  there  are  one 

or  two  populations  represented;  and  if  we  decide  there  are 
two  populations,  we  proceed  to  classify  the  individuals  into 
them.  Paulson  proposes  an  intuitively  reasonable  procedure 
and  considers  its  probabilistic  behavior  in  the  case  of  normal 
observations  of  known  variance.  His  work  opens  up  many  in¬ 
teresting  and  important  problems. 
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CHAPTER  IX 

Risk  and  Minimax  Ideas. 

We  have  seen  in  Chapter  VIII  that  von  Mises  (1945) 
fined  the  optimum  procedure  for  classification  into  one  of 
several  populations  as  that  procedure  for  which  the  minimum 
probability  of  correct  classification  is  maximized.  This 
formulation  marks  the  introduction  of  minimax  ideas  into 

discriminatory  analysis.  It  is  the  purpose  of  the  present 

/ 

chapter  to  describe  some  recent  work  in  this  direction. 

The  risk  and  minimax  notions  seem  to  have  been  intro¬ 
duced  into  statistical  literature  by  Neyman  and  Pearson  (1933^)* 
These  authors  were  concerned  with  testing  hypotheses,  but  as 
we  have  seen,  hypothesis  testing  is  analogous  to  the  two- 
population  classification  problem,  and  the  generalization  to 
k  populations  presents  no  difficulty.  The  specific  ex¬ 
tension.  of  the  risk  and  minimax  notions  to  the  k-population 
classification  problem  has  been  carried  out  by  Rao  (1947° » 
1948c),  Brown  ( 194® *  1949) »  and  Girshick  (1949).  We  shall 
here  present  the  notions  directly  in  the  extended  form. 

As  was  mentioned  earlier,  in  classifying  an  individual 
into  one  of  k  populations,  there  are  k(k-l)  distinct  pos¬ 
sible  errors  of  classification.  The  complexity  of  analysis, 
required  for  dealing  with  a  large  number  of  different  kinds 


of  error  is  greatly  reduced  if  we  can  in  some  way  gauge  the 
seriousness  of  all  of  these  ,errors  on  a  common  scale.  For 
example,  we  may  be  able  to  attach  an  economic  value  to  the 
loss,  sav  w. which  is  incurred  when  an  individual  who  in 

*  ij 

fact  belongs  to  IT.  is  assigned  to  TT\.  Presumably 

1  j 

wii  ~  s^nce  no  error  is  committed  when  an  individual  be¬ 
longing  to  is  assigned  to  TT  ,  but  the  theory  is  flexible 

-*■  I 

enough  to  permit  w^  ^  0  and  to  allow  the  w^  to  be  either 

positive  or  negative  if  this  is  desirable.  Here,  a  negative 

2 

’’loss”  would  correspond  to  a  gain.  There  will  be  k  of  the 
quantities  w  >  which  may  be  conveniently  presented  as  a 

“*•  v 

k  x  k  matrix: 


W  = 


W  •  •  % 

12 


w„„  •  * 

22 


w  •  •  • 

k2 


This  matrix  is  known  as  the  "loss  matrix,"  and  its  specifi¬ 
cation  is  not  the  task  of  the  statistician  but  depends  on  the 
use  to  be  made  of  the  classification  after  it  has  been  ef¬ 
fected.  (We  may  remark  that  W  corresponds  to  the  "pay-off 
matrix"  of  the  theory  of  games.) 

Certain  special  cases  of  W  are  of  interest.  If  we 
equate  the  diagonal  terms  w^,  w22»***»  wkk  to  zero,  and 
give  the  remaining  terms  a  common  (positive)  value  which  we 


may  take  to  be  1,  the  formulation  reduces  to  that  considered 
in  Chapter  VIII:  no  attention  is  paid  to  correct  classifi¬ 
cations*  and  all  misclassif ications  are t reated  alike*  (von 
Mises*  19 45) •  If  k  =  2,  and  the  diagonal  terms  are  0,  we 
obtain  the  matrix 


We  may  think  of  and  wpi  as  giving  the  relative  im¬ 

portance  of  the  two  types  of  error  in  a  test  of  a  statisti¬ 
cal  hypothesis.  An  interesting  illustration  of  this  situation 
applied  to  an  Air  Force  problem,  has  been  given  by  Berkson 

(1947). 


It  should  be  emphasized  that,  in  spite  of  the  great 
flexibility  of  the  present  approach,  it  cannot  be  applied 
to  all  problems.  There  are  situations  in  which  the  different 
errors  are  qualitatively  so  different  that  a  common  scale  can¬ 
not  be  constructed  for  them,  or  an  asymmetry  of  approach  may 
be  compelled  by  the  conditions  of  the  problem*  We  may  need 
instead  to  adopt  the  typical  method  of  hypothesis  testing,  and 
set  preassigned  bounds  to  the  probabilities  of  certain  of  the 
errors.  A  combination  of  the  loss  and  error-bound  methods 
may  be  needed  for  some  problems. 

The  simplification  inherent  in  the  loss  approach  is  at¬ 
tained  by  the  introduction  of  the  idea  of  risk.  The  risk  is 
simply  the  expected  loss;  that  is,  the  average  loss  which  may 


be  expected  in  long-run  use  of  the  classification  procedure 
being  considered.  Recalling  that  any  rule  for  classifying  an 
individual  into  one  of  k  populations  on  the  basis  of  certain 
observations  X^,  X^,***,  X^  corresponds  to  a  partitioning 
of  the  p-dimensional  sample  space  of  the  X* s  into  k  regions 
R^»  R^,***,  R  ,  let  us  denote  this  partitioning  by  R.  The 


If  a  priori  probabilities  Pi* 
be  an  unconditional  risk 


p^  exist,  there  will 


k 

**(R)  Pj  ri (r)  • 

i~l 


We  may  reasonably  take  our  objective  to  be  the  finding 
of  that  classification  rule  R  which  minimizes  the  risk.  In 
the  case  of  a  priori  probabilities,  this  objective  assumes  a 
very  simple  form.  We  seek  that  partition  R  of  the  sample 
space  for  which 

k  k 

r(R)  =  2  2  pi  wi .  ?(r .  |ir. ) 

i=l  j=l 

=  2  2  Pi  wij  p(Rjl^i)l 

j=i  L  i=i  J 
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is  minimum.  The  solution  of  this  problem  is  not  very  dif¬ 
ferent  mathematically  from  that  dealt  with  by  Welch  (1939)- 
If  there  exist  known  probability  density  functions 
f . (x. ,  •••,  x  )  of  the  observable  variables,  for  each  of 

lx  p 

the  populations  T*  ,  we  simply  compute  the  k  quantities 
k 

(2)  Cj  =  2  Pi 

i=l 

and  assign  I  to  that  population  TT  .  for  which  the  corre- 

V 

sponding  quantity  is  least.  Intuitively,  is  pro¬ 

portional  to  the  a  posteriori  risk  sustained  when  I  is 
assigned  to  and  we  assign  I  to  that  population  for 

which  the  a  posteriori  risk  is  least. 

If  no  a  priori  probabilities  are  assumed,  or  if  nothing 
is  known  about  them,  the  problem  is  more  complicated.  The 
individual  I  may  belong  to  any  one  of  the  k  populations, 
and  we  need  to  consider  all  k  of  the  conditional  risks  (1). 

A  natural  extension  of  the  approach  of  von  Mises  (1945)  would 
be  the  following:  find  that  partition  R  for  which  the  maxi¬ 
mum  of  the  conditional  risks  is  minimum.  Such  a  partition  is 
termed  a  minimax  partition.  The  adoption  of  this  definition 
of  optimum  corresponds  to  a  pessimistic  viewpoint:  we  don*t 
know  anything  about  the  true  population  of  I,  and  should 
guard  ourselves  against  the  worst  possibility— the  performance 
of  a  classification  rule  being  judged  by  the  risk  under  the 
least  favorable  contingency. 


A  simplification  of  the  problem,  which  does  not  lead,  to 
the  specification  of  a  unique  procedure,  but  which  clarifies 
the  possibilities,  is  effected  if  we  introduce  the  notions 
of  admissibility  and  the  complete  class*  A  partition  R  is 
said  to  be  inadmissible  if  there  exists  some  other  partition 
S  for  which  none  of  the  conditional  risks  are  greater  than 
they  are  for  R; 

r, (S)  =  r  (R) 5  i  =  1,2,* 6  * ,k| 

JL  X 


and  such  that  at  least  one  of  the  conditional  risks  is  less 
for  S  than  for  Bs 


r  4  ( S )  <  r.(R) 

3  J 


for  some 


It  is  clear  that  we  should  not  want  to  use  an  inadmissible 
classification  rule,  since  there  is  available  an  alternative 
rule  which  cannot  give  higher  risk  and  may  give  lower  risk® 

If  a  rule  is  not  inadmissible,  it  is  called  admissible ,  and 
a  collection  of  rules  which  contains  all  admissible  rules  is 
called  a  complete  class.  Prom  the  risk  point  of  view,  we 
need  never  consider  procedures  which  do  not  belong  to  a  com** 

plete  class *  The  notion  of  complete  class  was  introduced  in 

ing 

connection  with  hypothesis  test/by  Lehmann  (19i{.7),  and  was  ex¬ 
tended  by  Wald  (19lj.7)»  Tbe  concepts  of  loss,  risk,  minimax 
procedure »  admissibility,  and  complete  class  .play  a  fundament¬ 
al  pole  in  the  modern  theory  of  statistical  decision  functions 
developed  by  Abraham  Wald  (195°) °  Various  theorems  relating 


to  these  concepts  ,  for  the  special  case  of  the  k-popdlation 
classification  problem,  may  be  deduced  from  general  theorems 
of  Wald  (1950}  ,  or  may  be  obtained  more  simply  for  the  special 
case.  We  shall  merely  state  some  of  the  main  results® 

Even  if  there  are  no  a  priori  probabilities,  we  can  in¬ 
troduce  them  artificially,  and  consider  the  class  of  all 
classification  rules  obtainable  from.  (2)  when  we  permit 
p, ,  *  *  * ,  p,  to  assume  all  possible  sets  of  values.  The 

class  of  rules  so  obtained  is  known  as  the  class  of  Bayes 
solutions,  and  these  constitute  a  complete  class.  Under 


certain  res trie 

tions  one  can 

show 

that  all  of  the  Bayes  rules 

are  a dmi ssible© 

The  minimax 

rule 

turns  out  to  be  the  Bayes 

rule  for  which 

the  risks  are 

all  e 

qual  (the  so-called  M 0 on« 

stant  risk  B&ye 

s  solution, J? } 

The  result  of  these  theorems  is  to  give  a  theoretical 
solution  of  the  optimum  classification  problem,  provided  (i) 
the  loss  matrix  W  can  be  specified  in  a  satisfactory  way, 
and  (ii )  the  distribution  of  the  observable  variables  is 
completely  known  within  each  population.  The  same  comments 
could  be  made  here  that  were  made  about  the  von  Mises  re¬ 
sults  in  Chapter  VIII.  In  fact,  the  present  result  special¬ 
izes  to  the  von  Mises  result  when  W  is  appropriately  chosen. 

Even  if  provisos  (i)  and  (ii)  hold,  there  remains  the 
practical  problem  of  the  explicit  determination  of  the  regions 
Rj.  One  may  proceed  by  trial  and  error,  chosing  values  for 
p.,  *  p pt  **'  t  -Vy,  arbitrarily,  evaluating  the  corresponding 
risks,  and  then  correcting  the  p*  s  to  bring  the  risks 


closer  to  equality.  If  k  =  2,  it  is  usually  not  hard  to 
obtain  the  explicit  minimax  procedure*  but  with  k  =  3 
there  may  already  be  practical  difficulties.  There  is  need 
for  more  work  on  useful  approximations  and  shortcuts  in  find¬ 
ing  the  minimax  regions  when  k  =  3.  A  start  has  been  made 
by  Rao  (194.8c)*  The  problems  which  arise  are  rather  differ¬ 
ent*  according  as  the  distributions  f  are  discrete  or 
continuous*  and  both  cases  deserve  investigation. 
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